If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2
It's hanging at
Installing collected packages: zipp, importlib-resources, rpds-py, pkgutil-resolve-name, attrs, referencing, jsonschema-specifications, jsonschema, certifi, urllib3, idna, charset-normalizer, requests, pyparsing, PyYAML, six, pathlib2, orderedmultidict, furl, pyjwt, psutil, python-dateutil, platformdirs, distlib, filelock, virtualenv, clearml-agent
Successfully installed PyYAML-6.0.2 attrs-23.2.0 certifi-2024.7.4 charset-normalizer-3.3.2 clearml-agent-1.8.1 distlib-0.3.8 filelock-3.15.4 furl-2.1.3 idna-3.7 importlib-resources-6.4.0 jsonschema-4.23.0 jsonschema-specifications-2023.12.1 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 pkgutil-resolve-name-1.3.10 platformdirs-4.2.2 psutil-5.9.8 pyjwt-2.8.0 pyparsing-3.1.2 python-dateutil-2.8.2 referencing-0.35.1 requests-2.31.0 rpds-py-0.20.0 six-1.16.0 urllib3-1.26.19 virtualenv-20.26.3 zipp-3.20.0
WARNING: You are using pip version 20.1.1; however, version 24.2 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
In the config file it should be something like this: agent.cuda_version="11.2" I think
I am running the agent with clearml-agent daemon --queue training
ERROR: This container was built for NVIDIA Driver Release 530.30 or later, but
version 460.32.03 was detected and compatibility mode is UNAVAILABLE.
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
It seems to find a cuda 11, then it installs cuda 12
Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes:
,
,
Collecting torch==2.4.0.*
Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
Using cached
(209.4 MB)
2024-08-12 12:40:42
Collecting nvidia-nccl-cu12==2.20.5
Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Collecting nvidia-curand-cu12==10.3.2.106
to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!
I suggest running it in docker mode with a docker image that already has cuda installed
This has been resolved now! Thank you for your help @<1523701070390366208:profile|CostlyOstrich36>
I can install on the server with this command
I have set agent.package_manager.pip_version=""
which resolved that message
@<1523701070390366208:profile|CostlyOstrich36> same error now 😞
Environment setup completed successfully
Starting Task Execution:
/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL:
Alternatively, go to:
to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
Traceback (most recent call last):
File "facility_classifier/test_gpu.py", line 8, in <module>
assert torch.cuda.is_available()
AssertionError
Solved that by setting docker_args=["--privileged", "--network=host"]
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)
CUDA is the driver itself. The agent doesn't install CUDA but installs a compatible torch assuming that CUDA is properly installed.
docker="nvidia/cuda:11.8.0-base-ubuntu20.04"
Hi @<1734020162731905024:profile|RattyBluewhale45> , what version of pytorch are you specifying?
Just try as is first with this docker image + verify that the code can access cuda driver unrelated to the agent
Just to make sure, run the code on the machine itself to verify that python can actually detect the driver
pip install --pre torchvision --force-reinstall --index-url
None
I can install the correct torch version with this command:pip install --pre torchvision --force-reinstall --index-url ` None ```