SubstantialElk6 it seems the auto resolve of pytorch cuda failed,
What do you have in the "installed packages" section?
Ohh SubstantialElk6 please use agent RC3, (latest RC is somewhat broken sorry, we will pull it out)
I agree with Martin.B, it appears to be a CUDA mismatch. The version of torch is trying to use cuda 10.2 but you haveagent.default_docker.image = nvidia/cuda:10.1-runtime-ubuntu18.04
that should probably beagent.default_docker.image = nvidia/cuda:10.2-runtime-ubuntu18.04
AlertBlackbird30 , Actually the log says 10.2.docker_cmd = nvidia/cuda:10.2-devel-ubuntu18.04 -e GIT_SSL_NO_VERIFY=true
AgitatedDove14 , would you elaborate on this resolution process?
SubstantialElk6 could you post "Installed packaged" section under Execution of this specific Task?
Hi AgitatedDove14 , i changed everything to cuda 10.1 and tried again with the same rrror. the section as follows. I made sure torch==1.6.0+cu101 and torchvision==0.8.2+cu101 are in the pypi repo. But the same error still came up.
` # Python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0]
boto3 == 1.14.56
clearml == 0.17.4
numpy == 1.19.1
torch == 1.6.0
torchvision == 0.7.0
Detailed import analysis
**************************
IMPORT PACKAGE boto3
clearml.storage: 0
IMPORT PACKAGE clearml
pytorch_mnist.py: 14
IMPORT PACKAGE numpy
pytorch_mnist.py: 13
IMPORT PACKAGE torch
pytorch_mnist.py: 8,9,10,11
IMPORT PACKAGE torchvision
pytorch_mnist.py: 12 `
I can't seem to find the fix to this. Ended up using an image that comes with torch installed.
Hi SubstantialElk6
clearml-agent was just updated, it should solve the issue.2. Notice that "torch" / "torchvision" packages are resolved by the agent based on the pytorch compatibility table. Is there a way to reproduce the issue where it fails resolving the torch version? could you send a full log?
3. If you want a specific torch version , you can put a direct link to the torch wheel, for example: https://download.pytorch.org/whl/cu102/torch-1.6.0-cp37-cp37m-linux_x86_64.whl
Hi AgitatedDove14 , what version i should change it to? I'm currently on v0.17.2rc3.
SubstantialElk6 could you try with the latest (just released)?pip install clearml-agent==0.17.2
Then if possible, could you attach the full log of the agent's execution (Task->results->Console)