It means that there is an issue with the drivers. I suggest trying this docker image - nvcr.io/nvidia/pytorch:23.04-py3
This has been resolved now! Thank you for your help @<1523701070390366208:profile|CostlyOstrich36>
Hi @<1734020162731905024:profile|RattyBluewhale45> , what version of pytorch are you specifying?
to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!
Solved that by setting docker_args=["--privileged", "--network=host"]
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)
This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)
@<1523701070390366208:profile|CostlyOstrich36> I'm now running the agent with --docker
, and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
I can install on the server with this command
If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2
Hi @<1523701070390366208:profile|CostlyOstrich36> I am not specifying a version 🙂
ERROR: This container was built for NVIDIA Driver Release 530.30 or later, but
version 460.32.03 was detected and compatibility mode is UNAVAILABLE.
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas?
@<1523701070390366208:profile|CostlyOstrich36> same error now 😞
Environment setup completed successfully
Starting Task Execution:
/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL:
Alternatively, go to:
to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
Traceback (most recent call last):
File "facility_classifier/test_gpu.py", line 8, in <module>
assert torch.cuda.is_available()
AssertionError
But the process is still hanging, and not proceeding to actually running the clearml task
CUDA is the driver itself. The agent doesn't install CUDA but installs a compatible torch assuming that CUDA is properly installed.
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
I suggest running it in docker mode with a docker image that already has cuda installed
Just try as is first with this docker image + verify that the code can access cuda driver unrelated to the agent
Isn't the problem that CUDA 12 is being installed?
I think it tries to get the latest one. Are you using the agent in docker mode? you can also control this via clearml.conf
with agent.cuda_version
Just to make sure, run the code on the machine itself to verify that python can actually detect the driver
I can install the correct torch version with this command:pip install --pre torchvision --force-reinstall --index-url ` None ```