clearml will register conda packages that cannot be installed if clearml-agent is configured to use pip. So although it is nice that a complete package list is tracked, it makes it cumbersome to rerun the experiment.
I see, so it is actually not related to clearml 🎉
The agent is run with pip. However, the docker image uses conda (because NVIDIA uses conda to build PyTorch most probably). My theory is that when the task is run the first time on an agent, Task.init will update the requirements. Then when ran a second time, the task will contain the requirements of the (conda-) environment from the first run.
First one is the original, second one the clone
preinstalled in the environment (e.g. nvidia docker). These packages may not be available via pip, so the run will fail.
Okay that's the part that I'm missing, how come in the first run the package existed and in the cloned Task they are missing? I'm assuming agents are configured basically the same (i.e. docker mode with the same network access). What did I miss here ?
For example I run a task remotely. Now I decide I want to rerun it, but slightly change a parameter. So I clone the task and edit the parameter in the WebUI. Then I submit the task to a queue. When the clearml-agent pulls the tasks and tries to install the requirements, it will fail since the task requirements now contain packages that had been preinstalled in the environment (e.g. nvidia docker). These packages may not be available via pip, so the run will fail.
Hi ReassuredTiger98
Could you send the log of both run ?
(I'm not sure this is a bug, or some misconfiguration , but the scenario should have worked...)
In the first run the package only existed because it is preinstalled in the docker image. Afaik, in the second run it is also preinstalled, but pip will first try to resolve it and then see whether it already exists. But I am not to sure about this.
ReassuredTiger98 both are running with pip as package manager, I thought you mentioned conda as package manager, no?agent.package_manager.type = pip
Also the failed execution is looking for "ruamel_yaml_conda" but it is nowhere to be found on the original one?! how is that possible ?
Then when ran a second time, the task will contain the requirements of the (conda-) environment from the first run.
What you see in the log under "Summary - installed python packages:" will be exactly what is updated on the Task. But it does not contain the "ruamel_yaml_conda" package, this is what I cannot get...
But I did find this part:ERROR: conda 4.10.1 requires ruamel_yaml_conda>=0.11.14, which is not installed.
Which point to conda needing this package and then failing to install it the next time.
And lo and behold what I just found:
https://github.com/conda/conda/issues/10691
Hi ReassuredTiger98
When clearml is running inside the docker the installed packages of the WebUI get updated.
Yes, this is by design, so the agent can always reproduce the exact python environment.
(internal the original requirements is also stored, but not available in the UI).
What exactly is the use case here ? wouldn't make sense to reproduce the entire working environment when you clone the executed Task ?
clearml will register preinstalled conda packages as requirements.
For example I get the following error if I simply clone and rerun:ERROR: Could not find a version that satisfies the requirement ruamel_yaml_conda>=0.11.14 (from conda==4.10.1->-r /tmp/cached-reqs6wtc73be.txt (line 28)) (from versions: none) ERROR: No matching distribution found for ruamel_yaml_conda>=0.11.14 (from conda==4.10.1->-r /tmp/cached-reqs6wtc73be.txt (line 28))
clearml will register conda packages that cannot be installed if clearml-agent is configured to use pip. So although it is nice that a complete package list is tracked, it makes it cumbersome to rerun the experiment.
Yes mixing conda & pip is not supported by clearml (or conda or pip for that matter)
Even python package numbers might not exist on both.
We could add a flag not to update back the pip freeze, it's an easy feature to add. I'm just wondering on the exact use case