It means that there is an issue with the drivers. I suggest trying this docker image - nvcr.io/nvidia/pytorch:23.04-py3
Just to make sure, run the code on the machine itself to verify that python can actually detect the driver
This has been resolved now! Thank you for your help @<1523701070390366208:profile|CostlyOstrich36>
But the process is still hanging, and not proceeding to actually running the clearml task
to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!
pip install --pre torchvision --force-reinstall --index-url
None
I am running the agent with clearml-agent daemon --queue training
docker="nvidia/cuda:11.8.0-base-ubuntu20.04"
ERROR: This container was built for NVIDIA Driver Release 530.30 or later, but
version 460.32.03 was detected and compatibility mode is UNAVAILABLE.
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)
OK, then just try the docker image I suggested 🙂
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
@<1523701070390366208:profile|CostlyOstrich36> I'm now running the agent with --docker
, and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")
In the config file it should be something like this: agent.cuda_version="11.2" I think
I suggest running it in docker mode with a docker image that already has cuda installed
@<1523701070390366208:profile|CostlyOstrich36> same error now 😞
Environment setup completed successfully
Starting Task Execution:
/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL:
Alternatively, go to:
to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
Traceback (most recent call last):
File "facility_classifier/test_gpu.py", line 8, in <module>
assert torch.cuda.is_available()
AssertionError
I have set agent.package_manager.pip_version=""
which resolved that message
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas?
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)
I can install the correct torch version with this command:pip install --pre torchvision --force-reinstall --index-url ` None ```
Hi @<1734020162731905024:profile|RattyBluewhale45> , what version of pytorch are you specifying?