We'll check this. I assume we don't catch the error somehow or the proccess doesn't indicate it died failing
Hi JumpyPig73 , can you provide a snippet from the console log? Also, what OS are you running on?
Hi JumpyPig73 , I reproduced the OOM issue but for me it's failing. Are you handling the error in python somehow so the script exists gracefully? otherwise it looks like a regular python exception...
I think that's a hydra issue 🙂 I was able to reproduce this locally. I'll see what can be done
re you running it with an agent (that hydra triggers) ?
you mean clearml-agent? then no, I've been running the process manually up until now
clearml's callback is never called
yeah I suspect that's what might be happening which is why I was inquiring as to how and where exactly in the CML code that happens. Once I know, I can then place breakpoints in the critical regions and debug to see what's going in.
Sorry for the delay CostlyOstrich36 here's the relevant lines from the console:... File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA out of memory. Tried to allocate 748.00 MiB (GPU 0; 39.59 GiB total capacity; 34.67 GiB already allocated; 584.19 MiB free; 36.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This is how my application crashes when I use a batch size too big for example.
This particular server is on Ubuntu 20.04
Thanks! I'll check for this locally and get back
No, we currently don't handle it gracefully. It just crashes. But we do use hydra which does sort of arrests that exception first. I'm wondering if it's Hydra causing this issue. I'll look into it later today
AnxiousSeal95 I just checked and Hydra returns an exit code of 1
to mark the failure as does another toy program which just throws an exception. So my guess is CML is not using the exit code as a means to determine when the task failed. Are you able to share how CML determines when a task failed? If you could point me to the relevant code files, I'm happy to dive in and figure it out.
I didn't check with the toy task, I thought the error codes might be an issue here so was just looking for the difference. I'll check for that too.
But for my hydra task, it's always marked completed, never failed
Hi JumpyPig73
Funny enough this is being fixed as we speak 🙂
The main issue is that as you mentioned, ClearML does not "detect" the exit code when os.exit() is called, and this is why it is "missing" the failed test (because as mentioned, all exceptions are caught). This should be fixed in the next RC
Yep, I think I see it https://github.com/allegroai/clearml/commit/81de18dbce08229834d9bb0676446a151046e6a7
the state of the Task changes immediately when it crashes ?
I think so. It goes from running to completed immediately on crash
I haven't had much time to look into this but ran a quick debug and it seems like the exception
on the __exit_hook
variable is None
even though the process failed. So seems like hydra maybe somehow preventing the hook callback from executing correctly. will dig in a bit more next week
Thanks for confirming AgitatedDove14 . Do you have an approximate timeline as to when the RC might be out? I'm asking cause I'm going to write a workaround for it tomorrow and I'm wondering if I should just wait for the RC to come out.
Actually you cannot breakpoint at "atexit" calls (or at least doesn't work with my gdb)
But I would add a few prints here:
https://github.com/allegroai/clearml/blob/aa4e5ea7454e8f15b99bb2c77c4599fac2373c9d/clearml/task.py#L3166
Then we can figure out what can be changed so CML correctly registers process failures with Hydra
JumpyPig73 quick question, the state of the Task changes immediately when it crashes ? are you running it with an agent (that hydra triggers) ?
If this is vanilla clearml with Hydra runners, what I suspect happens is Hydra is overriding the signal callback hydra adds (like hydra clearml needs to figure out of the process crashed), then what happens is that clearml's callback is never called (or called without knowing a signal was triggered)
wdyt?
Thanks JumpyPig73
Yeah this would explain it ... (if hydra is setting something else we can tap into that as well)
Yes I believe it's hydra too, so just learning how CML determines process status will be really helpful
Then we can figure out what can be changed so CML correctly registers process failures with Hydra
AgitatedDove14 finally had a chance to properly look into it and I think I know what's going on
When running any task with hydra, hydra wraps the called method in its own https://github.com/facebookresearch/hydra/blob/a559aa4bf6807d5e3a82e065987825fa322351e2/hydra/_internal/utils.py#L211 . When the task throws any exception, it triggers the except
block of this method which handles the exception.
CML marks a task as failed only if the whatever exception the task generated was not handled and the task exited abruptly because it uses the sys.excepthook=self.exc_handler
.
However, in this scenario, since the exceptions is handled by hydra (and it always will be if hydra is used), the exc_handler
method, that's used by CML to determine if there was an exception, is never called because it was attached to the sys.excepthook
,which doesn't get triggered, and therefore CML sees it as no exception.
I think CML needs a better way of determining if there was an exception rather than hoping that the code doesn't catch the exception because in most good production systems, the exits will generally be gracefully handled. I'm modifying my own code base atm to do this. CML will never be able to detect the exception in such cases.
Maybe instead of relying on the system's excepthook, y'all can hook in a method at exit which will look for tracebacks and exception messages to determine if the code has terminated due to some error. Just throwing out an idea off the top of my head.
But generally IMO, CML should have a better approach for detecting errors and updating task statuses correctly.
Hope this helps.
Yeah, it might be the cause...I had a script with OOM and it crashed regularly 🙂
Are these toy programs registered as completed or failed?
Hi JumpyPig73 , I think it was synced to github. You can already test with: git install git+ https://github.com/allegroai/clearml.git