This turned out to be a couple issues, one with pip, and one with ClearML. After upgrading to 1.4.0rc, ClearML was able to find and download the correct wheel, but pip failed to install it, claiming it wasn't supported on this platform. I found that by going into the clearml.conf file and removing the default configuration that constrains pip_version: "<20.2",
the latest version of pip gets installed and doesn't throw that error. So I guess the take away is that there's a questionable default setting in the clearml.conf file that should probably be changed to use the latest version of pip by default, and not some older buggy version...
The log suggests there is no cu113 installation as well:
Warning, could not locate PyTorch torch==1.12.1 matching CUDA version 113
I believe ClearML has a different method of detecting installed packages. Despite adding that to my requirements.txt, the error persists. Also of note, under the Execution tab of the task, the list of installed packages is as follows (it matches my pip environment rather than what's in my requirements.txt file)clearml == 1.6.4 numpy == 1.23.1 pytorch_lightning == 1.7.0 tensorboard == 2.9.1 torch == 1.12.1+cu113 tqdm == 4.64.0 transformers == 4.21.1
When installing locally you said to pip to look for packages at that page, and you dont say that to the remote pip
I do keep both my local and remote instances updated, which at this time, they're both actually running CUDA 11.4 according to nvidia-smi, both with the exact same driver version (470.141.03). So it's not strictly a mismatch error since both systems are identical. As for why I have torch cu113 installed locally, I do believe that torch for cu114 wasn't available when I checked. But since it works fine on my local machine, it should work on the remote machine too?
With more experimenting, this is looking like a bug. I upgraded clearml-agent to 1.4.0rc and now it finds the wheel and downloads it, but then fails with the same error as above, saying the .whl file "is not a supported wheel on this platform". But why would this wheel not be supported? It's a standard x86 machine that can run this same code fine if I manually create an env and train the model without using ClearML.
Oh, does clearml automatically use the requirements.txt file when available?
Yup, absolutely. Otherwise it cannot run your code haha
It is likely you have mismatched cuda. I presume you locally have cu113 but cu114 remotely. Were you running any updates lately?
I think I know why though.
Clearml tries to install a package using pip, and pip cannot find the installation because it's not on pypi but it's listed in the pytorch download page
It's not because of the remote machine, it's the requirements 😅 as i said, the package is not on pypi. Try adding this at the top of your requirements.txt:
-f
torch==1.12.1+cu113 ...other deps...
For comparison, the contents of my requirements.txt file is:-f
torch==1.12.1+cu113 pytorch-lightning==1.7.0 transformers==4.21.1
Also tried updating the machine to CUDA 11.6, since Pytorch has prebuilt wheels for that version, and I'm still getting the same error. Is any developer able to weigh in on what's going on behind the scenes? Why is ClearML unable to find wheels that do exist?
Also, in the log file, it does sayTorch CUDA 113 download page found Warning, could not locate PyTorch torch==1.12.1 matching CUDA version 113, best candidate None
which indicates that it has found the page, just can't find the right wheel. But what's even more odd is that when I try to initiate a task from a another dev machine with no gpu (torch==1.12.1), I get the following error, indicating that torch found a wheel but couldn't install it:Torch CUDA 113 download page found Found PyTorch version torch==1.12.1 matching CUDA version 113 Torch CUDA 113 download page found Found PyTorch version torchvision==0.13.1 matching CUDA version 113 Collecting torch==1.12.1+cu113 Downloading
(1837.7 MB ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 GB 2.4 MB/s eta 0:00:00 Saved ./.clearml/pip-download-cache/cu114/torch-1.12.1+cu113-cp310-cp310-linux_x86_64.whl Successfully downloaded torch Collecting torchvision==0.13.1+cu113 Downloading
(23.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.4/23.4 MB 51.6 MB/s eta 0:00:00 Saved ./.clearml/pip-download-cache/cu114/torchvision-0.13.1+cu113-cp310-cp310-linux_x86_64.whl Successfully downloaded torchvision ERROR: torch-1.12.1+cu113-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform. clearml_agent: ERROR: Could not install task requirements!
Torch does have a build for cu113, as can be seen here: https://download.pytorch.org/whl/torch_stable.html which is what I have installed and working on my local machine. I think the question is, why can the remote machine not also find and install this?