Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, When Trying To Use A Remote Agent To Train A Model, The Initial Environment Setup On The Remote Machine Fails Because The List Of Requirements Located In /Tmp/Cached-Reqsaw90Argk.Txt Contains A Link To An Aarch64 Wheel:

Hi, when trying to use a remote agent to train a model, the initial environment setup on the remote machine fails because the list of requirements located in /tmp/cached-reqsaw90argk.txt contains a link to an aarch64 wheel:
` Flask==2.0.2

more_itertools==8.12.0
nltk==3.6.7
numpy==1.21.3
pytorch_lightning==1.5.10
scikit_learn==1.0.1
tensorboard==2.7.0

torchmetrics==0.7.2
tqdm==4.62.3
transformers==4.12.2
clearml==1.3.0 `
I'm trying to understand how that line is being generated. My local machine is x86, as is the remote one, and my local install of pytorch (x86) is working just fine. Any idea on why ClearML thinks I have an aarch64 package installed on my x86 machine? Thanks

  
  
Posted 2 years ago
Votes Newest

Answers 7


It may be worth noting the command that was used to install pytorch on my local machine: pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

When navigating to that link, the aarch64 wheel appears before the x86 wheel in the list. Might be a long shot, but is it possible that during the pip requirements generation phase, ClearML is visiting this link, looking for the first matching version, and using that without properly checking the architecture?

  
  
Posted 2 years ago

Thanks for the details TroubledJellyfish71 !
So the agent should have resolved automatically this line:
torch == 1.11.0+cu113 into the correct torch version (based on the cuda version installed, or cpu version if no cuda is installed)
Can you send the Task log (console) as executed by the agent (and failed)?
(you can DM it to me, so it's not public)

  
  
Posted 2 years ago

Hi TroubledJellyfish71
What do you have listed on the Task's execution "installed packages" section ? (of the original Task) ?
How did it end up with an http link of pytorch ?
Usually it would be torch==1.11 ...
EDIT:
I'm assuming the original Task was executed on a Mac M1, what are you getting when calling pip freeze ?
And where is the agent running ? (and is it venv or docker mode?)

  
  
Posted 2 years ago

Thanks, just sent it.

  
  
Posted 2 years ago

The installed packages section for the task contains the following:
` # Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]

Flask == 2.0.2
clearml == 1.3.0
more_itertools == 8.12.0
nltk == 3.6.7
numpy == 1.21.3
pytorch_lightning == 1.5.10
scikit_learn == 1.0.1
tensorboard == 2.7.0
torch == 1.11.0+cu113
torchmetrics == 0.7.2
tqdm == 4.62.3
transformers == 4.12.2 `
Only thing that looks different is that the torch line has changed from a URL, so somehow that URL is being generated within ClearML from this list?

The original task is being executed from an Ubuntu installation running on an x86 Ryzen 7 processor, no Mac M1 or other arm processor is or has ever been involved with this, which is why I find this so odd.

Pip freeze returns a very long list, as seen here: https://pastebin.com/gGKF123S
So it seems ClearML is selectively choosing only the required subset of the full env to install on the remote instance.

Finally, the agent is running in Ubuntu on an Intel Xeon server, in venv mode.

Thanks for your help, I hope this isn't some stupid mistake on my part, but I've been wracking my brain for a couple hours now.

  
  
Posted 2 years ago

Thanks for the fast response, I'll be keeping an eye out for the update. This makes sense as I had to update to 1.11 for a feature, and wasn't encountering the issue with 1.10 previously.

  
  
Posted 2 years ago

Thanks TroubledJellyfish71 I manged to locate the bug (and indeed it's the new aarach package support)
I'll make sure we push an RC in the next few days, until then as a workaround, you can put the full link (http) to the torch wheel
BTW: 1.11 is the first version to support aarch64, if you request a lower torch version, you will not encounter the bug

  
  
Posted 2 years ago
1K Views
7 Answers
2 years ago
one year ago
Tags