Hi, I'M Using Clearml'S Hosted Free Saas Offering. I'M Running Model Training In Pytorch On A Server And Pushing Metrics To Cml. I'Ve Noticed That Anytime My Training Job Fails Due To Gpu Oom Issues, Cml Marks The Job As

Answered

Hi,
I'm using ClearML's hosted free SaaS offering.
I'm running model training in PyTorch on a server and pushing metrics to CML. I've noticed that anytime my training job fails due to GPU OOM issues, CML marks the job as Completed when it should be Failed because the job ended due to an exception.
Is this intentional?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Votes Newest

Answers 29

The toy task is marked "Failed".

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

the state of the Task changes immediately when it crashes ?

I think so. It goes from running to completed immediately on crash

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

clearml's callback is never called

yeah I suspect that's what might be happening which is why I was inquiring as to how and where exactly in the CML code that happens. Once I know, I can then place breakpoints in the critical regions and debug to see what's going in.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Sorry for the delay CostlyOstrich36 here's the relevant lines from the console:
... File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA out of memory. Tried to allocate 748.00 MiB (GPU 0; 39.59 GiB total capacity; 34.67 GiB already allocated; 584.19 MiB free; 36.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFThis is how my application crashes when I use a batch size too big for example.
This particular server is on Ubuntu 20.04

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Are these toy programs registered as completed or failed?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Thanks! I'll check for this locally and get back

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Hi JumpyPig73 , can you provide a snippet from the console log? Also, what OS are you running on?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I think that's a hydra issue 🙂 I was able to reproduce this locally. I'll see what can be done

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Thanks for confirming AgitatedDove14 . Do you have an approximate timeline as to when the RC might be out? I'm asking cause I'm going to write a workaround for it tomorrow and I'm wondering if I should just wait for the RC to come out.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Then we can figure out what can be changed so CML correctly registers process failures with Hydra

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Thanks JumpyPig73
Yeah this would explain it ... (if hydra is setting something else we can tap into that as well)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi JumpyPig73
Funny enough this is being fixed as we speak 🙂
The main issue is that as you mentioned, ClearML does not "detect" the exit code when os.exit() is called, and this is why it is "missing" the failed test (because as mentioned, all exceptions are caught). This should be fixed in the next RC

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I didn't check with the toy task, I thought the error codes might be an issue here so was just looking for the difference. I'll check for that too.
But for my hydra task, it's always marked completed, never failed

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Yeah, it might be the cause...I had a script with OOM and it crashed regularly 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

re you running it with an agent (that hydra triggers) ?

you mean clearml-agent? then no, I've been running the process manually up until now

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

I haven't had much time to look into this but ran a quick debug and it seems like the exception on the __exit_hook variable is None even though the process failed. So seems like hydra maybe somehow preventing the hook callback from executing correctly. will dig in a bit more next week

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Thanks!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

AgitatedDove14 finally had a chance to properly look into it and I think I know what's going on
When running any task with hydra, hydra wraps the called method in its own https://github.com/facebookresearch/hydra/blob/a559aa4bf6807d5e3a82e065987825fa322351e2/hydra/_internal/utils.py#L211 . When the task throws any exception, it triggers the except block of this method which handles the exception.
CML marks a task as failed only if the whatever exception the task generated was not handled and the task exited abruptly because it uses the sys.excepthook=self.exc_handler .
However, in this scenario, since the exceptions is handled by hydra (and it always will be if hydra is used), the exc_handler method, that's used by CML to determine if there was an exception, is never called because it was attached to the sys.excepthook ,which doesn't get triggered, and therefore CML sees it as no exception.

I think CML needs a better way of determining if there was an exception rather than hoping that the code doesn't catch the exception because in most good production systems, the exits will generally be gracefully handled. I'm modifying my own code base atm to do this. CML will never be able to detect the exception in such cases.

Maybe instead of relying on the system's excepthook, y'all can hook in a method at exit which will look for tracebacks and exception messages to determine if the code has terminated due to some error. Just throwing out an idea off the top of my head.

But generally IMO, CML should have a better approach for detecting errors and updating task statuses correctly.

Hope this helps.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Then we can figure out what can be changed so CML correctly registers process failures with Hydra

JumpyPig73 quick question, the state of the Task changes immediately when it crashes ? are you running it with an agent (that hydra triggers) ?

If this is vanilla clearml with Hydra runners, what I suspect happens is Hydra is overriding the signal callback hydra adds (like hydra clearml needs to figure out of the process crashed), then what happens is that clearml's callback is never called (or called without knowing a signal was triggered)

wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AnxiousSeal95 I just checked and Hydra returns an exit code of 1 to mark the failure as does another toy program which just throws an exception. So my guess is CML is not using the exit code as a means to determine when the task failed. Are you able to share how CML determines when a task failed? If you could point me to the relevant code files, I'm happy to dive in and figure it out.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Actually you cannot breakpoint at "atexit" calls (or at least doesn't work with my gdb)
But I would add a few prints here:
https://github.com/allegroai/clearml/blob/aa4e5ea7454e8f15b99bb2c77c4599fac2373c9d/clearml/task.py#L3166

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

👍 Let me know if it solved the issue 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi JumpyPig73 , I think it was synced to github. You can already test with: git install git+ https://github.com/allegroai/clearml.git

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

No, we currently don't handle it gracefully. It just crashes. But we do use hydra which does sort of arrests that exception first. I'm wondering if it's Hydra causing this issue. I'll look into it later today

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

We'll check this. I assume we don't catch the error somehow or the proccess doesn't indicate it died failing

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Yep, I think I see it https://github.com/allegroai/clearml/commit/81de18dbce08229834d9bb0676446a151046e6a7

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Yes I believe it's hydra too, so just learning how CML determines process status will be really helpful

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

It did indeed. Thanks!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					JumpyClams73
				
					0
					 × 1

Hi JumpyPig73 , I reproduced the OOM issue but for me it's failing. Are you handling the error in python somehow so the script exists gracefully? otherwise it looks like a regular python exception...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Write your answer

2K Views

29 Answers

3 years ago

2 years ago