Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I'M Having An Issue Getting A Clearml-Agent Machine With A Rtx 3090 To Train Remotely Because It Can'T Install Pytorch. My Local Development Environment (Also With A 3090) Has Torch == 1.12.1+Cu113 Which I Installed With The Command:

Hi, I'm having an issue getting a clearml-agent machine with a RTX 3090 to train remotely because it can't install pytorch. My local development environment (also with a 3090) has torch == 1.12.1+cu113 which I installed with the command: pip install torch==1.12.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html . When I execute my task remotely, it fails with the error "Exception when trying to resolve python wheel: Could not find pytorch wheel URL for: torch==1.12.1 with cuda 114 support". The system seems kind of opaque, I can't readily find what exactly is going on behind this error. Any ideas what's going wrong here? I've attached the full log too. Thanks!

  
  
Posted one year ago
Votes Newest

Answers 15


For comparison, the contents of my requirements.txt file is:
-f torch==1.12.1+cu113 pytorch-lightning==1.7.0 transformers==4.21.1

  
  
Posted one year ago

I do keep both my local and remote instances updated, which at this time, they're both actually running CUDA 11.4 according to nvidia-smi, both with the exact same driver version (470.141.03). So it's not strictly a mismatch error since both systems are identical. As for why I have torch cu113 installed locally, I do believe that torch for cu114 wasn't available when I checked. But since it works fine on my local machine, it should work on the remote machine too?

  
  
Posted one year ago

When installing locally you said to pip to look for packages at that page, and you dont say that to the remote pip

  
  
Posted one year ago

Also, in the log file, it does say
Torch CUDA 113 download page found Warning, could not locate PyTorch torch==1.12.1 matching CUDA version 113, best candidate Nonewhich indicates that it has found the page, just can't find the right wheel. But what's even more odd is that when I try to initiate a task from a another dev machine with no gpu (torch==1.12.1), I get the following error, indicating that torch found a wheel but couldn't install it:
Torch CUDA 113 download page found Found PyTorch version torch==1.12.1 matching CUDA version 113 Torch CUDA 113 download page found Found PyTorch version torchvision==0.13.1 matching CUDA version 113 Collecting torch==1.12.1+cu113 Downloading (1837.7 MB ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 GB 2.4 MB/s eta 0:00:00 Saved ./.clearml/pip-download-cache/cu114/torch-1.12.1+cu113-cp310-cp310-linux_x86_64.whl Successfully downloaded torch Collecting torchvision==0.13.1+cu113 Downloading (23.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.4/23.4 MB 51.6 MB/s eta 0:00:00 Saved ./.clearml/pip-download-cache/cu114/torchvision-0.13.1+cu113-cp310-cp310-linux_x86_64.whl Successfully downloaded torchvision ERROR: torch-1.12.1+cu113-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform. clearml_agent: ERROR: Could not install task requirements!

  
  
Posted one year ago

The log suggests there is no cu113 installation as well:

Warning, could not locate PyTorch torch==1.12.1 matching CUDA version 113

  
  
Posted one year ago

This turned out to be a couple issues, one with pip, and one with ClearML. After upgrading to 1.4.0rc, ClearML was able to find and download the correct wheel, but pip failed to install it, claiming it wasn't supported on this platform. I found that by going into the clearml.conf file and removing the default configuration that constrains pip_version: "<20.2", the latest version of pip gets installed and doesn't throw that error. So I guess the take away is that there's a questionable default setting in the clearml.conf file that should probably be changed to use the latest version of pip by default, and not some older buggy version...

  
  
Posted one year ago

Oh, does clearml automatically use the requirements.txt file when available?

  
  
Posted one year ago

With more experimenting, this is looking like a bug. I upgraded clearml-agent to 1.4.0rc and now it finds the wheel and downloads it, but then fails with the same error as above, saying the .whl file "is not a supported wheel on this platform". But why would this wheel not be supported? It's a standard x86 machine that can run this same code fine if I manually create an env and train the model without using ClearML.

  
  
Posted one year ago

Also tried updating the machine to CUDA 11.6, since Pytorch has prebuilt wheels for that version, and I'm still getting the same error. Is any developer able to weigh in on what's going on behind the scenes? Why is ClearML unable to find wheels that do exist?

  
  
Posted one year ago

Yup, absolutely. Otherwise it cannot run your code haha

  
  
Posted one year ago

I think I know why though.

Clearml tries to install a package using pip, and pip cannot find the installation because it's not on pypi but it's listed in the pytorch download page

  
  
Posted one year ago

It's not because of the remote machine, it's the requirements 😅 as i said, the package is not on pypi. Try adding this at the top of your requirements.txt:

-f torch==1.12.1+cu113 ...other deps...

  
  
Posted one year ago

It is likely you have mismatched cuda. I presume you locally have cu113 but cu114 remotely. Were you running any updates lately?

  
  
Posted one year ago

I believe ClearML has a different method of detecting installed packages. Despite adding that to my requirements.txt, the error persists. Also of note, under the Execution tab of the task, the list of installed packages is as follows (it matches my pip environment rather than what's in my requirements.txt file)
clearml == 1.6.4 numpy == 1.23.1 pytorch_lightning == 1.7.0 tensorboard == 2.9.1 torch == 1.12.1+cu113 tqdm == 4.64.0 transformers == 4.21.1

  
  
Posted one year ago

Torch does have a build for cu113, as can be seen here: https://download.pytorch.org/whl/torch_stable.html which is what I have installed and working on my local machine. I think the question is, why can the remote machine not also find and install this?

  
  
Posted one year ago
783 Views
15 Answers
one year ago
one year ago
Tags