Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey Guys! I'M Having Some Issues With Pytorch And Clearml. I Am Starting A New Task Using Task.Create And Setting Pytorch As A Requirement Under `Packages`. For Some Reason Pytorch With Cuda 12 Is Being Installed, But I Need Cuda 11. Do You Know How To Se

Hey guys! I'm having some issues with pytorch and clearml. I am starting a new task using task.create and setting pytorch as a requirement under packages. For some reason pytorch with CUDA 12 is being installed, but I need CUDA 11. Do you know how to set it to install CUDA 11?

  
  
Posted 5 months ago
Votes Newest

Answers 41


ERROR: This container was built for NVIDIA Driver Release 530.30 or later, but
       version 460.32.03 was detected and compatibility mode is UNAVAILABLE.

       [[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
  
  
Posted 5 months ago

I can install the correct torch version with this command:
pip install --pre torchvision --force-reinstall --index-url ` None ```

  
  
Posted 5 months ago

pip install --pre torchvision --force-reinstall --index-url None

  
  
Posted 5 months ago

But the process is still hanging, and not proceeding to actually running the clearml task

  
  
Posted 5 months ago

I suggest running it in docker mode with a docker image that already has cuda installed

  
  
Posted 5 months ago

CostlyOstrich36 do you have any ideas?

  
  
Posted 5 months ago

agent.cuda_version="11.2"

  
  
Posted 5 months ago

within a docker

  
  
Posted 5 months ago

It seems to find a cuda 11, then it installs cuda 12


Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes: 
, 
, 

Collecting torch==2.4.0.*
  Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
  Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
  Using cached 
 (209.4 MB)
2024-08-12 12:40:42
Collecting nvidia-nccl-cu12==2.20.5
  Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Collecting nvidia-curand-cu12==10.3.2.106
  
  
Posted 5 months ago

I think it tries to get the latest one. Are you using the agent in docker mode? you can also control this via clearml.conf with agent.cuda_version

  
  
Posted 5 months ago

It means that there is an issue with the drivers. I suggest trying this docker image - nvcr.io/nvidia/pytorch:23.04-py3

  
  
Posted 5 months ago

to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!

  
  
Posted 5 months ago

Hi CostlyOstrich36 I am not specifying a version 🙂

  
  
Posted 5 months ago

I can install on the server with this command

  
  
Posted 5 months ago

CostlyOstrich36 same error now 😞

Environment setup completed successfully
Starting Task Execution:
/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL: 
 Alternatively, go to: 
 to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
Traceback (most recent call last):
  File "facility_classifier/test_gpu.py", line 8, in <module>
    assert torch.cuda.is_available()
AssertionError
  
  
Posted 5 months ago

Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.

  
  
Posted 5 months ago

In the config file it should be something like this: agent.cuda_version="11.2" I think

  
  
Posted 5 months ago

Thank you I will try that

  
  
Posted 5 months ago

I have set agent.package_manager.pip_version="" which resolved that message

  
  
Posted 5 months ago

OK, then just try the docker image I suggested 🙂

  
  
Posted 5 months ago

If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2

  
  
Posted 5 months ago

Just to make sure, run the code on the machine itself to verify that python can actually detect the driver

  
  
Posted 5 months ago

Thank you for getting back to me

  
  
Posted 5 months ago

unrelated to the agent itself

  
  
Posted 5 months ago

I am trying task.create like so:

task = Task.create(
    script="test_gpu.py",
    packages=["torch"],
)
  
  
Posted 5 months ago

Thank you

  
  
Posted 5 months ago

This one seems to be compatible: [nvcr.io/nvidia/pytorch:22.04-py3](http://nvcr.io/nvidia/pytorch:22.04-py3)

  
  
Posted 5 months ago

Hi RattyBluewhale45 , what version of pytorch are you specifying?

  
  
Posted 5 months ago

CostlyOstrich36 I'm now running the agent with --docker , and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")

  
  
Posted 5 months ago

Just try as is first with this docker image + verify that the code can access cuda driver unrelated to the agent

  
  
Posted 5 months ago
18K Views
41 Answers
5 months ago
5 months ago
Tags