Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I'M Using Clearml'S Hosted Free Saas Offering. I'M Running Model Training In Pytorch On A Server And Pushing Metrics To Cml. I'Ve Noticed That Anytime My Training Job Fails Due To Gpu Oom Issues, Cml Marks The Job As

Hi,
I'm using ClearML's hosted free SaaS offering.
I'm running model training in PyTorch on a server and pushing metrics to CML. I've noticed that anytime my training job fails due to GPU OOM issues, CML marks the job as Completed when it should be Failed because the job ended due to an exception.
Is this intentional?

  
  
Posted 2 years ago
Votes Newest

Answers 29


Thanks!

  
  
Posted 2 years ago

Hi JumpyPig73 , I reproduced the OOM issue but for me it's failing. Are you handling the error in python somehow so the script exists gracefully? otherwise it looks like a regular python exception...

  
  
Posted 2 years ago

Hi JumpyPig73 , can you provide a snippet from the console log? Also, what OS are you running on?

  
  
Posted 2 years ago

Sorry for the delay CostlyOstrich36 here's the relevant lines from the console:
... File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA out of memory. Tried to allocate 748.00 MiB (GPU 0; 39.59 GiB total capacity; 34.67 GiB already allocated; 584.19 MiB free; 36.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFThis is how my application crashes when I use a batch size too big for example.
This particular server is on Ubuntu 20.04

  
  
Posted 2 years ago

AnxiousSeal95 I just checked and Hydra returns an exit code of 1 to mark the failure as does another toy program which just throws an exception. So my guess is CML is not using the exit code as a means to determine when the task failed. Are you able to share how CML determines when a task failed? If you could point me to the relevant code files, I'm happy to dive in and figure it out.

  
  
Posted 2 years ago

No, we currently don't handle it gracefully. It just crashes. But we do use hydra which does sort of arrests that exception first. I'm wondering if it's Hydra causing this issue. I'll look into it later today

  
  
Posted 2 years ago

We'll check this. I assume we don't catch the error somehow or the proccess doesn't indicate it died failing

  
  
Posted 2 years ago

Yeah, it might be the cause...I had a script with OOM and it crashed regularly 🙂

  
  
Posted 2 years ago

Are these toy programs registered as completed or failed?

  
  
Posted 2 years ago

Then we can figure out what can be changed so CML correctly registers process failures with Hydra

JumpyPig73 quick question, the state of the Task changes immediately when it crashes ? are you running it with an agent (that hydra triggers) ?

If this is vanilla clearml with Hydra runners, what I suspect happens is Hydra is overriding the signal callback hydra adds (like hydra clearml needs to figure out of the process crashed), then what happens is that clearml's callback is never called (or called without knowing a signal was triggered)

wdyt?

  
  
Posted 2 years ago

re you running it with an agent (that hydra triggers) ?

you mean clearml-agent? then no, I've been running the process manually up until now

  
  
Posted 2 years ago

I didn't check with the toy task, I thought the error codes might be an issue here so was just looking for the difference. I'll check for that too.
But for my hydra task, it's always marked completed, never failed

  
  
Posted 2 years ago

the state of the Task changes immediately when it crashes ?

I think so. It goes from running to completed immediately on crash

  
  
Posted 2 years ago

I think that's a hydra issue 🙂 I was able to reproduce this locally. I'll see what can be done

  
  
Posted 2 years ago

Yes I believe it's hydra too, so just learning how CML determines process status will be really helpful

  
  
Posted 2 years ago

Then we can figure out what can be changed so CML correctly registers process failures with Hydra

  
  
Posted 2 years ago

The toy task is marked "Failed".

  
  
Posted 2 years ago

AgitatedDove14 finally had a chance to properly look into it and I think I know what's going on
When running any task with hydra, hydra wraps the called method in its own https://github.com/facebookresearch/hydra/blob/a559aa4bf6807d5e3a82e065987825fa322351e2/hydra/_internal/utils.py#L211 . When the task throws any exception, it triggers the except block of this method which handles the exception.
CML marks a task as failed only if the whatever exception the task generated was not handled and the task exited abruptly because it uses the sys.excepthook=self.exc_handler .
However, in this scenario, since the exceptions is handled by hydra (and it always will be if hydra is used), the exc_handler method, that's used by CML to determine if there was an exception, is never called because it was attached to the sys.excepthook ,which doesn't get triggered, and therefore CML sees it as no exception.

I think CML needs a better way of determining if there was an exception rather than hoping that the code doesn't catch the exception because in most good production systems, the exits will generally be gracefully handled. I'm modifying my own code base atm to do this. CML will never be able to detect the exception in such cases.

Maybe instead of relying on the system's excepthook, y'all can hook in a method at exit which will look for tracebacks and exception messages to determine if the code has terminated due to some error. Just throwing out an idea off the top of my head.

But generally IMO, CML should have a better approach for detecting errors and updating task statuses correctly.

Hope this helps.

  
  
Posted 2 years ago

I haven't had much time to look into this but ran a quick debug and it seems like the exception on the __exit_hook variable is None even though the process failed. So seems like hydra maybe somehow preventing the hook callback from executing correctly. will dig in a bit more next week

  
  
Posted 2 years ago

Hi JumpyPig73
Funny enough this is being fixed as we speak 🙂
The main issue is that as you mentioned, ClearML does not "detect" the exit code when os.exit() is called, and this is why it is "missing" the failed test (because as mentioned, all exceptions are caught). This should be fixed in the next RC

  
  
Posted 2 years ago

clearml's callback is never called

yeah I suspect that's what might be happening which is why I was inquiring as to how and where exactly in the CML code that happens. Once I know, I can then place breakpoints in the critical regions and debug to see what's going in.

  
  
Posted 2 years ago

Actually you cannot breakpoint at "atexit" calls (or at least doesn't work with my gdb)
But I would add a few prints here:
https://github.com/allegroai/clearml/blob/aa4e5ea7454e8f15b99bb2c77c4599fac2373c9d/clearml/task.py#L3166

  
  
Posted 2 years ago

Thanks JumpyPig73
Yeah this would explain it ... (if hydra is setting something else we can tap into that as well)

  
  
Posted 2 years ago

Thanks! I'll check for this locally and get back

  
  
Posted 2 years ago

Thanks for confirming AgitatedDove14 . Do you have an approximate timeline as to when the RC might be out? I'm asking cause I'm going to write a workaround for it tomorrow and I'm wondering if I should just wait for the RC to come out.

  
  
Posted 2 years ago

Hi JumpyPig73 , I think it was synced to github. You can already test with: git install git+ https://github.com/allegroai/clearml.git

  
  
Posted 2 years ago

👍 Let me know if it solved the issue 🙂

  
  
Posted 2 years ago

It did indeed. Thanks!

  
  
Posted 2 years ago
561 Views
29 Answers
2 years ago
one year ago
Tags