Hi CostlyOstrich36 I am not specifying a version 🙂
to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!
I think it tries to get the latest one. Are you using the agent in docker mode? you can also control this via clearml.conf
with agent.cuda_version
pip install --pre torchvision --force-reinstall --index-url
None
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
This has been resolved now! Thank you for your help CostlyOstrich36
Isn't the problem that CUDA 12 is being installed?
I can install the correct torch version with this command:pip install --pre torchvision --force-reinstall --index-url ` None ```
OK, then just try the docker image I suggested 🙂
Hi RattyBluewhale45 , what version of pytorch are you specifying?
docker="nvidia/cuda:11.8.0-base-ubuntu20.04"
CostlyOstrich36 same error now 😞
Environment setup completed successfully
Starting Task Execution:
/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL:
Alternatively, go to:
to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
Traceback (most recent call last):
File "facility_classifier/test_gpu.py", line 8, in <module>
assert torch.cuda.is_available()
AssertionError
I have set agent.package_manager.pip_version=""
which resolved that message
Just to make sure, run the code on the machine itself to verify that python can actually detect the driver
ERROR: This container was built for NVIDIA Driver Release 530.30 or later, but
version 460.32.03 was detected and compatibility mode is UNAVAILABLE.
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
I suggest running it in docker mode with a docker image that already has cuda installed
It's hanging at
Installing collected packages: zipp, importlib-resources, rpds-py, pkgutil-resolve-name, attrs, referencing, jsonschema-specifications, jsonschema, certifi, urllib3, idna, charset-normalizer, requests, pyparsing, PyYAML, six, pathlib2, orderedmultidict, furl, pyjwt, psutil, python-dateutil, platformdirs, distlib, filelock, virtualenv, clearml-agent
Successfully installed PyYAML-6.0.2 attrs-23.2.0 certifi-2024.7.4 charset-normalizer-3.3.2 clearml-agent-1.8.1 distlib-0.3.8 filelock-3.15.4 furl-2.1.3 idna-3.7 importlib-resources-6.4.0 jsonschema-4.23.0 jsonschema-specifications-2023.12.1 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 pkgutil-resolve-name-1.3.10 platformdirs-4.2.2 psutil-5.9.8 pyjwt-2.8.0 pyparsing-3.1.2 python-dateutil-2.8.2 referencing-0.35.1 requests-2.31.0 rpds-py-0.20.0 six-1.16.0 urllib3-1.26.19 virtualenv-20.26.3 zipp-3.20.0
WARNING: You are using pip version 20.1.1; however, version 24.2 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
CostlyOstrich36 I'm now running the agent with --docker
, and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
CostlyOstrich36 do you have any ideas?
It means that there is an issue with the drivers. I suggest trying this docker image - nvcr.io/nvidia/pytorch:23.04-py3
Solved that by setting docker_args=["--privileged", "--network=host"]
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)