Hi, I'M Having An Issue Getting A Clearml-Agent Machine With A Rtx 3090 To Train Remotely Because It Can'T Install Pytorch. My Local Development Environment (Also With A 3090) Has Torch == 1.12.1+Cu113 Which I Installed With The Command:

Answered

Hi, I'm having an issue getting a clearml-agent machine with a RTX 3090 to train remotely because it can't install pytorch. My local development environment (also with a 3090) has torch == 1.12.1+cu113 which I installed with the command: pip install torch==1.12.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html . When I execute my task remotely, it fails with the error "Exception when trying to resolve python wheel: Could not find pytorch wheel URL for: torch==1.12.1 with cuda 114 support". The system seems kind of opaque, I can't readily find what exactly is going on behind this error. Any ideas what's going wrong here? I've attached the full log too. Thanks!

  				
Posted 
	2 years ago

					More  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

Votes Newest

Answers 15

The log suggests there is no cu113 installation as well:

Warning, could not locate PyTorch torch==1.12.1 matching CUDA version 113

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

I believe ClearML has a different method of detecting installed packages. Despite adding that to my requirements.txt, the error persists. Also of note, under the Execution tab of the task, the list of installed packages is as follows (it matches my pip environment rather than what's in my requirements.txt file)
clearml == 1.6.4 numpy == 1.23.1 pytorch_lightning == 1.7.0 tensorboard == 2.9.1 torch == 1.12.1+cu113 tqdm == 4.64.0 transformers == 4.21.1

  				
Posted 
	2 years ago

					More  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

Torch does have a build for cu113, as can be seen here: https://download.pytorch.org/whl/torch_stable.html which is what I have installed and working on my local machine. I think the question is, why can the remote machine not also find and install this?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

This turned out to be a couple issues, one with pip, and one with ClearML. After upgrading to 1.4.0rc, ClearML was able to find and download the correct wheel, but pip failed to install it, claiming it wasn't supported on this platform. I found that by going into the clearml.conf file and removing the default configuration that constrains pip_version: "<20.2", the latest version of pip gets installed and doesn't throw that error. So I guess the take away is that there's a questionable default setting in the clearml.conf file that should probably be changed to use the latest version of pip by default, and not some older buggy version...

  				
Posted 
	2 years ago

					More  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

Yup, absolutely. Otherwise it cannot run your code haha

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

It's not because of the remote machine, it's the requirements 😅 as i said, the package is not on pypi. Try adding this at the top of your requirements.txt:

-f torch==1.12.1+cu113 ...other deps...

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

With more experimenting, this is looking like a bug. I upgraded clearml-agent to 1.4.0rc and now it finds the wheel and downloads it, but then fails with the same error as above, saying the .whl file "is not a supported wheel on this platform". But why would this wheel not be supported? It's a standard x86 machine that can run this same code fine if I manually create an env and train the model without using ClearML.

  				
Posted 
	2 years ago

					More  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

I think I know why though.

Clearml tries to install a package using pip, and pip cannot find the installation because it's not on pypi but it's listed in the pytorch download page

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

Also tried updating the machine to CUDA 11.6, since Pytorch has prebuilt wheels for that version, and I'm still getting the same error. Is any developer able to weigh in on what's going on behind the scenes? Why is ClearML unable to find wheels that do exist?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

Oh, does clearml automatically use the requirements.txt file when available?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

It is likely you have mismatched cuda. I presume you locally have cu113 but cu114 remotely. Were you running any updates lately?

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

Also, in the log file, it does say
Torch CUDA 113 download page found Warning, could not locate PyTorch torch==1.12.1 matching CUDA version 113, best candidate Nonewhich indicates that it has found the page, just can't find the right wheel. But what's even more odd is that when I try to initiate a task from a another dev machine with no gpu (torch==1.12.1), I get the following error, indicating that torch found a wheel but couldn't install it:
Torch CUDA 113 download page found Found PyTorch version torch==1.12.1 matching CUDA version 113 Torch CUDA 113 download page found Found PyTorch version torchvision==0.13.1 matching CUDA version 113 Collecting torch==1.12.1+cu113 Downloading (1837.7 MB ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 GB 2.4 MB/s eta 0:00:00 Saved ./.clearml/pip-download-cache/cu114/torch-1.12.1+cu113-cp310-cp310-linux_x86_64.whl Successfully downloaded torch Collecting torchvision==0.13.1+cu113 Downloading (23.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.4/23.4 MB 51.6 MB/s eta 0:00:00 Saved ./.clearml/pip-download-cache/cu114/torchvision-0.13.1+cu113-cp310-cp310-linux_x86_64.whl Successfully downloaded torchvision ERROR: torch-1.12.1+cu113-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform. clearml_agent: ERROR: Could not install task requirements!

  				
Posted 
	2 years ago

					More  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

When installing locally you said to pip to look for packages at that page, and you dont say that to the remote pip

  				
Posted 
	2 years ago

					More  		
  Report
		
					RotundHedgehog76
				
					0
					 × 1

For comparison, the contents of my requirements.txt file is:
-f torch==1.12.1+cu113 pytorch-lightning==1.7.0 transformers==4.21.1

  				
Posted 
	2 years ago

					More  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

I do keep both my local and remote instances updated, which at this time, they're both actually running CUDA 11.4 according to nvidia-smi, both with the exact same driver version (470.141.03). So it's not strictly a mismatch error since both systems are identical. As for why I have torch cu113 installed locally, I do believe that torch for cu114 wasn't available when I checked. But since it works fine on my local machine, it should work on the remote machine too?

  				
Posted 
	2 years ago

					More  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

Write your answer

1K Views

15 Answers

2 years ago