StickyBlackbird93 the agent is supposed to solve for the correct version of pytorch based on the Cuda in the container. Sounds like for some reason it fails? Can you provide the log of the Task that failed? Are you running the agent in docker-mode , or inside a docker?
Yep, I set this env variable and it just not help. ClearML continue set up torch for wrong platform(arm). This problem was resolved only after I write in req in repo particular pytorch weels.
Moreover, if i set weels in UI clerml continue set up wrong package.
Seems like bug
I'm running agent inside docker.
So this means venv mode...
Unfortunately, right now I can not attach the logs, I will attach them a little later.
No worries, feel free to DM them if you feel this is to much to post them here
Hi StickyBlackbird93
Yes, this agent version is rather old ( clearml_agent v1.0.0
)
it had a bug where pytorch wheel aaarch broke the agent (by default the agent in docker mode, will use the latest stable version, but not in venv mode)
Basically upgrade to the latest clearml-agent version it should solve the issue:pip3 install -U clearml-agemnt==1.2.3
BTW for future debugging, this is the interesting part of the log (Notice it is looking for the correct pytorch based on the auto detected cuda version, 11.5)
Then it failed because it found aarch instead of x86 (this is the bug that was fixed in the latest version)1654011488836 sjc13-t04-mlt02:!6e9:gpu0 DEBUG Torch CUDA 115 download page found Found PyTorch version torch==1.11.0 matching CUDA version 115 [31mERROR: torch-1.11.0-cp39-cp39-manylinux2014_aarch64.whl is not a supported wheel on this platform.[0m[31m
Hi Danil,
You can use the following env variable to set it 🙂CLEARML_AGENT_SKIP_PIP_VENV_INSTALL
AgitatedDove14 agent doesn't even try to resolve this conflict, is directly install wrong version of pytorch. I'm running agent inside docker.
Unfortunately, right now I can not attach the logs, I will attach them a little later.