
Reputation
Badges 1
57 × Eureka!I haven't had much time to look into this but ran a quick debug and it seems like the exception
on the __exit_hook
variable is None
even though the process failed. So seems like hydra maybe somehow preventing the hook callback from executing correctly. will dig in a bit more next week
so there's no way to do that when running in pip or conda mode?
No, we currently don't handle it gracefully. It just crashes. But we do use hydra which does sort of arrests that exception first. I'm wondering if it's Hydra causing this issue. I'll look into it later today
the CML free SaaS offering. It'll probably hit https://app.clear.ml/api if I'm not wrong
Thanks! I'll check for this locally and get back
AnxiousSeal95 I just checked and Hydra returns an exit code of 1
to mark the failure as does another toy program which just throws an exception. So my guess is CML is not using the exit code as a means to determine when the task failed. Are you able to share how CML determines when a task failed? If you could point me to the relevant code files, I'm happy to dive in and figure it out.
I'm looking at the docs on docker mode and running the script. Is this script run after the venv and code dir are setup, or immediately after the container starts but before the environment for running the experiment is setup?
the state of the Task changes immediately when it crashes ?
I think so. It goes from running to completed immediately on crash
This is great! Thanks!
If I have access to the logs, python env and git commits, is there an API to log those to the experiments too?
Ok. I think I misunderstood what you said. I thought you meant you've already opened a bug ticket. If that's not the case, do you want to me create one on github?
We have run experiments in the past (before I put ClearML into my code) which has logged scalars, plots etc. to local tensorboard. Is there any way to import this data to ClearML cloud for tracking, visualization and comparison?
Yes, but is it run after the requirements are installed and the code is mounted? The docs sayIf we look at the console output in the web UI, the third entry should start with Executing: ['docker', 'run', '-t', '--gpus...', and towards the end of the entry, where the downloaded packages are mentioned, we can see the additional shell-script apt-get install -y bindfs.
which seems like that would be the case but I'm not sure what the 1st or 2nd entries are and so want to confirm.
Then we can figure out what can be changed so CML correctly registers process failures with Hydra
yes, it seems like the command line args are recorded now but the connect
call with my parameter dictionary now fails with exception:
` Error executing job with overrides: ['model_name=all-test', ...]
Traceback (most recent call last):
File "/home/binoydalal/miniconda3/envs/DS974/lib/python3.9/site-packages/clearml/binding/hydra_bind.py", line 146, in _patched_task_function
return task_function(a_config, *a_args, **a_kwargs)
....
File "/home/binoydalal/miniconda3/envs/DS974/li...
Aah I see it only says Image
. Somehow I hit tunnel vision on Base Docker Image
as stated in the docs and couldn't identify both to mean the same thing 😅 thanks
The Agent pulls the Task, and then reproduces it, and now it will execute the extra_docker_shell_script that was put in the configuration file.
Does this imply the former? Env is fully setup, then script is run, then experiment is started by calling the executable?
Thanks! I'll give the RC a shot.
I think the fire + hydra combination is not an issue anymore. We're going to separate the 2 out, and I tried it last night and argument modification and passing worked fine with hydra only.
In any case, thanks for you help Martin!
Do you want me to try running it manually?
I thought the agent created a new conda env and installed all packages, recorded during initial task run, from scratch (except for caching with venv). Is that not the case?
For hydra-core:
` ...
- humanfriendly==10.0
- hydra==2.5
- idna==3.3
... `
Will try this. Thanks for promptly looking into this. Much appreciated!
Could it be the script itself is using vanilla sys.argv and not Argparser ? (edited)
Thanks for bringing this up. Our code uses fire
to parse command line args and then sort of hands off to hydra, so yes it does use sys.argv
initially. Is this a possible issue?
I'm queuing the task to my laptop by cloning on the web console. I have my agent setup to use conda as the primary package manager.
This is great! Thanks for the example Martin, much appreciated!
I didn't check with the toy task, I thought the error codes might be an issue here so was just looking for the difference. I'll check for that too.
But for my hydra task, it's always marked completed, never failed