@<1523701070390366208:profile|CostlyOstrich36> same error now 😞
Environment setup completed successfully
Starting Task Execution:
/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL:
Alternatively, go to:
to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
Traceback (most recent call last):
File "facility_classifier/test_gpu.py", line 8, in <module>
assert torch.cuda.is_available()
AssertionError
Isn't the problem that CUDA 12 is being installed?
CUDA is the driver itself. The agent doesn't install CUDA but installs a compatible torch assuming that CUDA is properly installed.
docker="nvidia/cuda:11.8.0-base-ubuntu20.04"
ERROR: This container was built for NVIDIA Driver Release 530.30 or later, but
version 460.32.03 was detected and compatibility mode is UNAVAILABLE.
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)
I am running the agent with clearml-agent daemon --queue training
OK, then just try the docker image I suggested 🙂
@<1523701070390366208:profile|CostlyOstrich36> I'm now running the agent with --docker
, and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
Hi @<1734020162731905024:profile|RattyBluewhale45> , what version of pytorch are you specifying?
Hi @<1523701070390366208:profile|CostlyOstrich36> I am not specifying a version 🙂
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
This has been resolved now! Thank you for your help @<1523701070390366208:profile|CostlyOstrich36>
I can install on the server with this command
I think it tries to get the latest one. Are you using the agent in docker mode? you can also control this via clearml.conf
with agent.cuda_version
I suggest running it in docker mode with a docker image that already has cuda installed
In the config file it should be something like this: agent.cuda_version="11.2" I think
It's hanging at
Installing collected packages: zipp, importlib-resources, rpds-py, pkgutil-resolve-name, attrs, referencing, jsonschema-specifications, jsonschema, certifi, urllib3, idna, charset-normalizer, requests, pyparsing, PyYAML, six, pathlib2, orderedmultidict, furl, pyjwt, psutil, python-dateutil, platformdirs, distlib, filelock, virtualenv, clearml-agent
Successfully installed PyYAML-6.0.2 attrs-23.2.0 certifi-2024.7.4 charset-normalizer-3.3.2 clearml-agent-1.8.1 distlib-0.3.8 filelock-3.15.4 furl-2.1.3 idna-3.7 importlib-resources-6.4.0 jsonschema-4.23.0 jsonschema-specifications-2023.12.1 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 pkgutil-resolve-name-1.3.10 platformdirs-4.2.2 psutil-5.9.8 pyjwt-2.8.0 pyparsing-3.1.2 python-dateutil-2.8.2 referencing-0.35.1 requests-2.31.0 rpds-py-0.20.0 six-1.16.0 urllib3-1.26.19 virtualenv-20.26.3 zipp-3.20.0
WARNING: You are using pip version 20.1.1; however, version 24.2 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas?
It means that there is an issue with the drivers. I suggest trying this docker image - nvcr.io/nvidia/pytorch:23.04-py3
I have set agent.package_manager.pip_version=""
which resolved that message
But the process is still hanging, and not proceeding to actually running the clearml task
to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!