@<1523701070390366208:profile|CostlyOstrich36> I'm now running the agent with --docker
, and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
But the process is still hanging, and not proceeding to actually running the clearml task
I suggest running it in docker mode with a docker image that already has cuda installed
docker="nvidia/cuda:11.8.0-base-ubuntu20.04"
ERROR: This container was built for NVIDIA Driver Release 530.30 or later, but
version 460.32.03 was detected and compatibility mode is UNAVAILABLE.
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
This has been resolved now! Thank you for your help @<1523701070390366208:profile|CostlyOstrich36>
It means that there is an issue with the drivers. I suggest trying this docker image - nvcr.io/nvidia/pytorch:23.04-py3
to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
Hi @<1734020162731905024:profile|RattyBluewhale45> , what version of pytorch are you specifying?
This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)
In the config file it should be something like this: agent.cuda_version="11.2" I think
It seems to find a cuda 11, then it installs cuda 12
Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes:
,
,
Collecting torch==2.4.0.*
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
Using cached
(209.4 MB)
2024-08-12 12:40:42
Collecting nvidia-nccl-cu12==2.20.5
Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Collecting nvidia-curand-cu12==10.3.2.106
CUDA is the driver itself. The agent doesn't install CUDA but installs a compatible torch assuming that CUDA is properly installed.
Hi @<1523701070390366208:profile|CostlyOstrich36> I am not specifying a version 🙂
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas?
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
I can install the correct torch version with this command:pip install --pre torchvision --force-reinstall --index-url ` None ```
Just to make sure, run the code on the machine itself to verify that python can actually detect the driver
Isn't the problem that CUDA 12 is being installed?
I think it tries to get the latest one. Are you using the agent in docker mode? you can also control this via clearml.conf
with agent.cuda_version
pip install --pre torchvision --force-reinstall --index-url
None
OK, then just try the docker image I suggested 🙂