This has been resolved now! Thank you for your help CostlyOstrich36
But the process is still hanging, and not proceeding to actually running the clearml task
Just try as is first with this docker image + verify that the code can access cuda driver unrelated to the agent
If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2
CUDA is the driver itself. The agent doesn't install CUDA but installs a compatible torch assuming that CUDA is properly installed.
CostlyOstrich36 I'm now running the agent with --docker
, and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
It seems to find a cuda 11, then it installs cuda 12
Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes:
,
,
Collecting torch==2.4.0.*
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
Using cached
(209.4 MB)
2024-08-12 12:40:42
Collecting nvidia-nccl-cu12==2.20.5
Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Collecting nvidia-curand-cu12==10.3.2.106