Reputation
Badges 1
15 × Eureka!Also tried updating the machine to CUDA 11.6, since Pytorch has prebuilt wheels for that version, and I'm still getting the same error. Is any developer able to weigh in on what's going on behind the scenes? Why is ClearML unable to find wheels that do exist?
Thanks for the fast response, I'll be keeping an eye out for the update. This makes sense as I had to update to 1.11 for a feature, and wasn't encountering the issue with 1.10 previously.
I believe ClearML has a different method of detecting installed packages. Despite adding that to my requirements.txt, the error persists. Also of note, under the Execution tab of the task, the list of installed packages is as follows (it matches my pip environment rather than what's in my requirements.txt file)clearml == 1.6.4 numpy == 1.23.1 pytorch_lightning == 1.7.0 tensorboard == 2.9.1 torch == 1.12.1+cu113 tqdm == 4.64.0 transformers == 4.21.1
I do keep both my local and remote instances updated, which at this time, they're both actually running CUDA 11.4 according to nvidia-smi, both with the exact same driver version (470.141.03). So it's not strictly a mismatch error since both systems are identical. As for why I have torch cu113 installed locally, I do believe that torch for cu114 wasn't available when I checked. But since it works fine on my local machine, it should work on the remote machine too?
With more experimenting, this is looking like a bug. I upgraded clearml-agent to 1.4.0rc and now it finds the wheel and downloads it, but then fails with the same error as above, saying the .whl file "is not a supported wheel on this platform". But why would this wheel not be supported? It's a standard x86 machine that can run this same code fine if I manually create an env and train the model without using ClearML.
Buried in documentation, I missed that completely. Thanks!
Torch does have a build for cu113, as can be seen here: https://download.pytorch.org/whl/torch_stable.html which is what I have installed and working on my local machine. I think the question is, why can the remote machine not also find and install this?
The installed packages section for the task contains the following:
` # Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
Flask == 2.0.2
clearml == 1.3.0
more_itertools == 8.12.0
nltk == 3.6.7
numpy == 1.21.3
pytorch_lightning == 1.5.10
scikit_learn == 1.0.1
tensorboard == 2.7.0
torch == 1.11.0+cu113
torchmetrics == 0.7.2
tqdm == 4.62.3
transformers == 4.12.2 `
Only thing that looks different is that the torch line has changed from a URL, so somehow that URL is being generated with...
For comparison, the contents of my requirements.txt file is:-f
torch==1.12.1+cu113 pytorch-lightning==1.7.0 transformers==4.21.1
Also, in the log file, it does sayTorch CUDA 113 download page found Warning, could not locate PyTorch torch==1.12.1 matching CUDA version 113, best candidate None
which indicates that it has found the page, just can't find the right wheel. But what's even more odd is that when I try to initiate a task from a another dev machine with no gpu (torch==1.12.1), I get the following error, indicating that torch found a wheel but couldn't install it:
` Torch CUDA 113 download page found
Found Py...
This turned out to be a couple issues, one with pip, and one with ClearML. After upgrading to 1.4.0rc, ClearML was able to find and download the correct wheel, but pip failed to install it, claiming it wasn't supported on this platform. I found that by going into the clearml.conf file and removing the default configuration that constrains pip_version: "<20.2",
the latest version of pip gets installed and doesn't throw that error. So I guess the take away is that there's a questionable d...
Oh, does clearml automatically use the requirements.txt file when available?
It may be worth noting the command that was used to install pytorch on my local machine: pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f
https://download.pytorch.org/whl/cu113/torch_stable.html
When navigating to that link, the aarch64 wheel appears before the x86 wheel in the list. Might be a long shot, but is it possible that during the pip requirements generation phase, ClearML is visiting this link, looking for the first matching version, and ...
Perfect, I'll just reinstall then, don't have any important models to lose yet. Thanks!