It may be worth noting the command that was used to install pytorch on my local machine: pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f
https://download.pytorch.org/whl/cu113/torch_stable.html
When navigating to that link, the aarch64 wheel appears before the x86 wheel in the list. Might be a long shot, but is it possible that during the pip requirements generation phase, ClearML is visiting this link, looking for the first matching version, and using that without properly checking the architecture?
Thanks for the details TroubledJellyfish71 !
So the agent should have resolved automatically this line:torch == 1.11.0+cu113
into the correct torch version (based on the cuda version installed, or cpu version if no cuda is installed)
Can you send the Task log (console) as executed by the agent (and failed)?
(you can DM it to me, so it's not public)
Hi TroubledJellyfish71
What do you have listed on the Task's execution "installed packages" section ? (of the original Task) ?
How did it end up with an http link of pytorch ?
Usually it would be torch==1.11
...
EDIT:
I'm assuming the original Task was executed on a Mac M1, what are you getting when calling pip freeze
?
And where is the agent running ? (and is it venv or docker mode?)
The installed packages section for the task contains the following:
` # Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
Flask == 2.0.2
clearml == 1.3.0
more_itertools == 8.12.0
nltk == 3.6.7
numpy == 1.21.3
pytorch_lightning == 1.5.10
scikit_learn == 1.0.1
tensorboard == 2.7.0
torch == 1.11.0+cu113
torchmetrics == 0.7.2
tqdm == 4.62.3
transformers == 4.12.2 `
Only thing that looks different is that the torch line has changed from a URL, so somehow that URL is being generated within ClearML from this list?
The original task is being executed from an Ubuntu installation running on an x86 Ryzen 7 processor, no Mac M1 or other arm processor is or has ever been involved with this, which is why I find this so odd.
Pip freeze returns a very long list, as seen here: https://pastebin.com/gGKF123S
So it seems ClearML is selectively choosing only the required subset of the full env to install on the remote instance.
Finally, the agent is running in Ubuntu on an Intel Xeon server, in venv mode.
Thanks for your help, I hope this isn't some stupid mistake on my part, but I've been wracking my brain for a couple hours now.
Thanks for the fast response, I'll be keeping an eye out for the update. This makes sense as I had to update to 1.11 for a feature, and wasn't encountering the issue with 1.10 previously.
Thanks TroubledJellyfish71 I manged to locate the bug (and indeed it's the new aarach package support)
I'll make sure we push an RC in the next few days, until then as a workaround, you can put the full link (http) to the torch wheel
BTW: 1.11 is the first version to support aarch64, if you request a lower torch version, you will not encounter the bug