I think it tries to get the latest one. Are you using the agent in docker mode? you can also control this via clearml.conf
with agent.cuda_version
Isn't the problem that CUDA 12 is being installed?
This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
It seems to find a cuda 11, then it installs cuda 12
Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes:
,
,
Collecting torch==2.4.0.*
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
Using cached
(209.4 MB)
2024-08-12 12:40:42
Collecting nvidia-nccl-cu12==2.20.5
Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Collecting nvidia-curand-cu12==10.3.2.106
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)
If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2
Solved that by setting docker_args=["--privileged", "--network=host"]
Just to make sure, run the code on the machine itself to verify that python can actually detect the driver