Hi, When Trying To Use A Remote Agent To Train A Model, The Initial Environment Setup On The Remote Machine Fails Because The List Of Requirements Located In /Tmp/Cached-Reqsaw90Argk.Txt Contains A Link To An Aarch64 Wheel:

Answered

Hi, when trying to use a remote agent to train a model, the initial environment setup on the remote machine fails because the list of requirements located in /tmp/cached-reqsaw90argk.txt contains a link to an aarch64 wheel:
` Flask==2.0.2

more_itertools==8.12.0
nltk==3.6.7
numpy==1.21.3
pytorch_lightning==1.5.10
scikit_learn==1.0.1
tensorboard==2.7.0

torchmetrics==0.7.2
tqdm==4.62.3
transformers==4.12.2
clearml==1.3.0 `
I'm trying to understand how that line is being generated. My local machine is x86, as is the remote one, and my local install of pytorch (x86) is working just fine. Any idea on why ClearML thinks I have an aarch64 package installed on my x86 machine? Thanks

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

Votes Newest

Answers 7

Thanks, just sent it.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

Hi TroubledJellyfish71
What do you have listed on the Task's execution "installed packages" section ? (of the original Task) ?
How did it end up with an http link of pytorch ?
Usually it would be torch==1.11 ...
EDIT:
I'm assuming the original Task was executed on a Mac M1, what are you getting when calling pip freeze ?
And where is the agent running ? (and is it venv or docker mode?)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks TroubledJellyfish71 I manged to locate the bug (and indeed it's the new aarach package support)
I'll make sure we push an RC in the next few days, until then as a workaround, you can put the full link (http) to the torch wheel
BTW: 1.11 is the first version to support aarch64, if you request a lower torch version, you will not encounter the bug

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The installed packages section for the task contains the following:
` # Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]

Flask == 2.0.2
clearml == 1.3.0
more_itertools == 8.12.0
nltk == 3.6.7
numpy == 1.21.3
pytorch_lightning == 1.5.10
scikit_learn == 1.0.1
tensorboard == 2.7.0
torch == 1.11.0+cu113
torchmetrics == 0.7.2
tqdm == 4.62.3
transformers == 4.12.2 `
Only thing that looks different is that the torch line has changed from a URL, so somehow that URL is being generated within ClearML from this list?

The original task is being executed from an Ubuntu installation running on an x86 Ryzen 7 processor, no Mac M1 or other arm processor is or has ever been involved with this, which is why I find this so odd.

Pip freeze returns a very long list, as seen here: https://pastebin.com/gGKF123S
So it seems ClearML is selectively choosing only the required subset of the full env to install on the remote instance.

Finally, the agent is running in Ubuntu on an Intel Xeon server, in venv mode.

Thanks for your help, I hope this isn't some stupid mistake on my part, but I've been wracking my brain for a couple hours now.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

It may be worth noting the command that was used to install pytorch on my local machine: pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

When navigating to that link, the aarch64 wheel appears before the x86 wheel in the list. Might be a long shot, but is it possible that during the pip requirements generation phase, ClearML is visiting this link, looking for the first matching version, and using that without properly checking the architecture?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

Thanks for the fast response, I'll be keeping an eye out for the update. This makes sense as I had to update to 1.11 for a feature, and wasn't encountering the issue with 1.10 previously.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					TroubledJellyfish71
				
					0
					 × 1

Thanks for the details TroubledJellyfish71 !
So the agent should have resolved automatically this line:
torch == 1.11.0+cu113 into the correct torch version (based on the cuda version installed, or cpu version if no cuda is installed)
Can you send the Task log (console) as executed by the agent (and failed)?
(you can DM it to me, so it's not public)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

7 Answers

3 years ago

2 years ago