Reputation
Badges 1
57 × Eureka!clearml's callback is never called
yeah I suspect that's what might be happening which is why I was inquiring as to how and where exactly in the CML code that happens. Once I know, I can then place breakpoints in the critical regions and debug to see what's going in.
Thanks for confirming AgitatedDove14 . Do you have an approximate timeline as to when the RC might be out? I'm asking cause I'm going to write a workaround for it tomorrow and I'm wondering if I should just wait for the RC to come out.
Could it be hydra was installed on your laptop via conda not pip?
Yes, while we do use a conda env, our packages are installed using pip
. That being said, I have hydra-core==1.1.1
in my local dependencies as well.
Ok. I think I misunderstood what you said. I thought you meant you've already opened a bug ticket. If that's not the case, do you want to me create one on github?
Sorry if I sounded curt. Didn't mean to. To clarify, I've created my account using Google SSO on http://app.clear.ml , and am currently on the Free tier. I am pushing all my data onto CML's servers. This error happens when I try to query those servers for the metrics and variants for a particular task of mine.
I'm signed up for Pro. Is there some restricted docs site for pro users CostlyOstrich36 ?
The Agent pulls the Task, and then reproduces it, and now it will execute the extra_docker_shell_script that was put in the configuration file.
Does this imply the former? Env is fully setup, then script is run, then experiment is started by calling the executable?
I tried using 1.2.0rc1
but it doesn't work as expected. We have a bunch of options for fire in the entrypoint, but irrespective of whichever I enter on the command line, fire still just executes the first command that was defined in my dictionary under fire.Fire({...})
. It however routes to the correct command if I use 1.1.6
which tells me that this is being caused by some issue with 1.2.0rc1
OS - Ubuntu 20.04
Conda - 4.10.3
The agent is running in a conda env with python==3.9.7
Is this the info you were looking for?
re you running it with an agent (that hydra triggers) ?
you mean clearml-agent? then no, I've been running the process manually up until now
I thought the agent created a new conda env and installed all packages, recorded during initial task run, from scratch (except for caching with venv). Is that not the case?
so there's no way to do that when running in pip or conda mode?
Then we can figure out what can be changed so CML correctly registers process failures with Hydra
I haven't had much time to look into this but ran a quick debug and it seems like the exception
on the __exit_hook
variable is None
even though the process failed. So seems like hydra maybe somehow preventing the hook callback from executing correctly. will dig in a bit more next week
Thanks! I'll check for this locally and get back
Thanks for getting back Martin. The hydra example fails when i try to queue it to my local withStarting Task Execution: Traceback (most recent call last): File "hydra_example.py", line 10, in <module> @hydra.main(config_path="config_files", config_name="config") AttributeError: module 'hydra' has no attribute 'main'
Sorry for the delay CostlyOstrich36 here's the relevant lines from the console:
` ...
File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/home/binoyloaner/miniconda3/envs/DS974/lib/python3....
the state of the Task changes immediately when it crashes ?
I think so. It goes from running to completed immediately on crash
I think there's some confusion here. I'm not running the server. My metrics are getting logged to the CML cloud.
No, we currently don't handle it gracefully. It just crashes. But we do use hydra which does sort of arrests that exception first. I'm wondering if it's Hydra causing this issue. I'll look into it later today
Got it. Thanks for clearing that up!
I think the fire + hydra combination is not an issue anymore. We're going to separate the 2 out, and I tried it last night and argument modification and passing worked fine with hydra only.
In any case, thanks for you help Martin!
Will try this. Thanks for promptly looking into this. Much appreciated!
Would you happen to have a timeline for when the feature might become available?
AgitatedDove14 finally had a chance to properly look into it and I think I know what's going on
When running any task with hydra, hydra wraps the called method in its own https://github.com/facebookresearch/hydra/blob/a559aa4bf6807d5e3a82e065987825fa322351e2/hydra/_internal/utils.py#L211 . When the task throws any exception, it triggers the except
block of this method which handles the exception.
CML marks a task as failed only if the whatever exception the task generated was not ha...