Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey Guys! I'M Having Some Issues With Pytorch And Clearml. I Am Starting A New Task Using Task.Create And Setting Pytorch As A Requirement Under `Packages`. For Some Reason Pytorch With Cuda 12 Is Being Installed, But I Need Cuda 11. Do You Know How To Se

Hey guys! I'm having some issues with pytorch and clearml. I am starting a new task using task.create and setting pytorch as a requirement under packages. For some reason pytorch with CUDA 12 is being installed, but I need CUDA 11. Do you know how to set it to install CUDA 11?

  
  
Posted one month ago
Votes Newest

Answers 41


Hi @<1734020162731905024:profile|RattyBluewhale45> , what version of pytorch are you specifying?

  
  
Posted one month ago

Hi @<1523701070390366208:profile|CostlyOstrich36> I am not specifying a version 🙂

  
  
Posted one month ago

Thank you for getting back to me

  
  
Posted one month ago

I can install on the server with this command

  
  
Posted one month ago

pip install --pre torchvision --force-reinstall --index-url None

  
  
Posted one month ago

I am trying task.create like so:

task = Task.create(
    script="test_gpu.py",
    packages=["torch"],
)
  
  
Posted one month ago

It seems to find a cuda 11, then it installs cuda 12


Torch CUDA 111 index page found, adding `
`
PyTorch: Adding index `
` and installing `torch ==2.4.0.*`
Looking in indexes: 
, 
, 

Collecting torch==2.4.0.*
  Using cached torch-2.4.0-cp310-cp310-manylinux1_x86_64.whl (797.2 MB)
2024-08-12 12:40:37
Collecting clearml
  Using cached clearml-1.16.3-py2.py3-none-any.whl (1.2 MB)
Collecting triton==3.0.0
  Using cached 
 (209.4 MB)
2024-08-12 12:40:42
Collecting nvidia-nccl-cu12==2.20.5
  Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Collecting nvidia-curand-cu12==10.3.2.106
  
  
Posted one month ago

I think it tries to get the latest one. Are you using the agent in docker mode? you can also control this via clearml.conf with agent.cuda_version

  
  
Posted one month ago

Thank you

  
  
Posted one month ago

I have set agent{cuda_version: 11.2}

  
  
Posted one month ago

In the config file it should be something like this: agent.cuda_version="11.2" I think

  
  
Posted one month ago

docker="nvidia/cuda:11.8.0-base-ubuntu20.04"

  
  
Posted one month ago

agent.cuda_version="11.2"

  
  
Posted one month ago

I am running the agent with clearml-agent daemon --queue training

  
  
Posted one month ago

I suggest running it in docker mode with a docker image that already has cuda installed

  
  
Posted one month ago

Thank you

  
  
Posted one month ago

@<1523701070390366208:profile|CostlyOstrich36> I'm now running the agent with --docker , and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")

  
  
Posted one month ago

It's hanging at


Installing collected packages: zipp, importlib-resources, rpds-py, pkgutil-resolve-name, attrs, referencing, jsonschema-specifications, jsonschema, certifi, urllib3, idna, charset-normalizer, requests, pyparsing, PyYAML, six, pathlib2, orderedmultidict, furl, pyjwt, psutil, python-dateutil, platformdirs, distlib, filelock, virtualenv, clearml-agent
Successfully installed PyYAML-6.0.2 attrs-23.2.0 certifi-2024.7.4 charset-normalizer-3.3.2 clearml-agent-1.8.1 distlib-0.3.8 filelock-3.15.4 furl-2.1.3 idna-3.7 importlib-resources-6.4.0 jsonschema-4.23.0 jsonschema-specifications-2023.12.1 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 pkgutil-resolve-name-1.3.10 platformdirs-4.2.2 psutil-5.9.8 pyjwt-2.8.0 pyparsing-3.1.2 python-dateutil-2.8.2 referencing-0.35.1 requests-2.31.0 rpds-py-0.20.0 six-1.16.0 urllib3-1.26.19 virtualenv-20.26.3 zipp-3.20.0
WARNING: You are using pip version 20.1.1; however, version 24.2 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
  
  
Posted one month ago

@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas?

  
  
Posted one month ago

Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.

  
  
Posted one month ago

I have set agent.package_manager.pip_version="" which resolved that message

  
  
Posted one month ago

But the process is still hanging, and not proceeding to actually running the clearml task

  
  
Posted one month ago

to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!

  
  
Posted one month ago

Solved that by setting docker_args=["--privileged", "--network=host"]

  
  
Posted one month ago

@<1523701070390366208:profile|CostlyOstrich36> same error now 😞

Environment setup completed successfully
Starting Task Execution:
/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL: 
 Alternatively, go to: 
 to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
Traceback (most recent call last):
  File "facility_classifier/test_gpu.py", line 8, in <module>
    assert torch.cuda.is_available()
AssertionError
  
  
Posted one month ago

It means that there is an issue with the drivers. I suggest trying this docker image - nvcr.io/nvidia/pytorch:23.04-py3

  
  
Posted one month ago

Thank you I will try that

  
  
Posted one month ago

Isn't the problem that CUDA 12 is being installed?

  
  
Posted one month ago

CUDA is the driver itself. The agent doesn't install CUDA but installs a compatible torch assuming that CUDA is properly installed.

  
  
Posted one month ago

Just to make sure, run the code on the machine itself to verify that python can actually detect the driver

  
  
Posted one month ago
2K Views
41 Answers
one month ago
one month ago
Tags