
Reputation
Badges 1
57 × Eureka!No, we currently don't handle it gracefully. It just crashes. But we do use hydra which does sort of arrests that exception first. I'm wondering if it's Hydra causing this issue. I'll look into it later today
AnxiousSeal95 I just checked and Hydra returns an exit code of 1
to mark the failure as does another toy program which just throws an exception. So my guess is CML is not using the exit code as a means to determine when the task failed. Are you able to share how CML determines when a task failed? If you could point me to the relevant code files, I'm happy to dive in and figure it out.
I didn't check with the toy task, I thought the error codes might be an issue here so was just looking for the difference. I'll check for that too.
But for my hydra task, it's always marked completed, never failed
Yes I believe it's hydra too, so just learning how CML determines process status will be really helpful
clearml's callback is never called
yeah I suspect that's what might be happening which is why I was inquiring as to how and where exactly in the CML code that happens. Once I know, I can then place breakpoints in the critical regions and debug to see what's going in.
AgitatedDove14 finally had a chance to properly look into it and I think I know what's going on
When running any task with hydra, hydra wraps the called method in its own https://github.com/facebookresearch/hydra/blob/a559aa4bf6807d5e3a82e065987825fa322351e2/hydra/_internal/utils.py#L211 . When the task throws any exception, it triggers the except
block of this method which handles the exception.
CML marks a task as failed only if the whatever exception the task generated was not ha...
I haven't had much time to look into this but ran a quick debug and it seems like the exception
on the __exit_hook
variable is None
even though the process failed. So seems like hydra maybe somehow preventing the hook callback from executing correctly. will dig in a bit more next week
Thanks for confirming AgitatedDove14 . Do you have an approximate timeline as to when the RC might be out? I'm asking cause I'm going to write a workaround for it tomorrow and I'm wondering if I should just wait for the RC to come out.
no problem. Thanks for the information Erez!
This is great! Thanks!
If I have access to the logs, python env and git commits, is there an API to log those to the experiments too?
This is great! Thanks for the example Martin, much appreciated!
Thanks! I'll check for this locally and get back
(the one created when you executed the code on your laptop
I haven't executed the task myself at all. I just cloned it from the examples that are available in the SaaS console upon account creation - specifically hyper-parameters example
under the ClearML Examples
project.
I tried using 1.2.0rc1
but it doesn't work as expected. We have a bunch of options for fire in the entrypoint, but irrespective of whichever I enter on the command line, fire still just executes the first command that was defined in my dictionary under fire.Fire({...})
. It however routes to the correct command if I use 1.1.6
which tells me that this is being caused by some issue with 1.2.0rc1
I'm looking at the docs on docker mode and running the script. Is this script run after the venv and code dir are setup, or immediately after the container starts but before the environment for running the experiment is setup?
Thanks! Do you have a public bug tracker? If yes, are you able to share the issue number so I can follow it?
I need to put it into my code, so will be eagerly waiting for the fix
Ok. I think I misunderstood what you said. I thought you meant you've already opened a bug ticket. If that's not the case, do you want to me create one on github?
Yes, but is it run after the requirements are installed and the code is mounted? The docs sayIf we look at the console output in the web UI, the third entry should start with Executing: ['docker', 'run', '-t', '--gpus...', and towards the end of the entry, where the downloaded packages are mentioned, we can see the additional shell-script apt-get install -y bindfs.
which seems like that would be the case but I'm not sure what the 1st or 2nd entries are and so want to confirm.
the CML free SaaS offering. It'll probably hit https://app.clear.ml/api if I'm not wrong
I think there's some confusion here. I'm not running the server. My metrics are getting logged to the CML cloud.
The Agent pulls the Task, and then reproduces it, and now it will execute the extra_docker_shell_script that was put in the configuration file.
Does this imply the former? Env is fully setup, then script is run, then experiment is started by calling the executable?
Then we can figure out what can be changed so CML correctly registers process failures with Hydra
Yep, I think I see it https://github.com/allegroai/clearml/commit/81de18dbce08229834d9bb0676446a151046e6a7