Just try as is first with this docker image + verify that the code can access cuda driver unrelated to the agent
Isn't the problem that CUDA 12 is being installed?
CostlyOstrich36 I'm now running the agent with --docker
, and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2
But the process is still hanging, and not proceeding to actually running the clearml task
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
It means that there is an issue with the drivers. I suggest trying this docker image - nvcr.io/nvidia/pytorch:23.04-py3
I have set agent.package_manager.pip_version=""
which resolved that message
CUDA is the driver itself. The agent doesn't install CUDA but installs a compatible torch assuming that CUDA is properly installed.