Hi. I Get Some Problem With Clearml Agent. I Start Training On My Local Device, Clone Run, And Start This Run In Docker On Cluster. But, Seems Like Clearml Agent Сaches Environment(Package Weels, Python Version, Etc). Can I Config Clearml Agent To Not Сac

Answered

Hi. I get some problem with clearml agent. I start training on my local device, clone run, and start this run in docker on cluster. But, seems like clearml agent сaches environment(package weels, python version, etc). Can I config clearml agent to not сaches local environment, just clear run in container?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StickyBlackbird93
				
					0
					 × 1

Votes Newest

Answers 7

StickyBlackbird93 the agent is supposed to solve for the correct version of pytorch based on the Cuda in the container. Sounds like for some reason it fails? Can you provide the log of the Task that failed? Are you running the agent in docker-mode , or inside a docker?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StickyBlackbird93
				
					0
					 × 1

Yep, I set this env variable and it just not help. ClearML continue set up torch for wrong platform(arm). This problem was resolved only after I write in req in repo particular pytorch weels.
Moreover, if i set weels in UI clerml continue set up wrong package.
Seems like bug

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StickyBlackbird93
				
					0
					 × 1

I'm running agent inside docker.

So this means venv mode...

Unfortunately, right now I can not attach the logs, I will attach them a little later.

No worries, feel free to DM them if you feel this is to much to post them here

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi StickyBlackbird93
Yes, this agent version is rather old ( clearml_agent v1.0.0 )
it had a bug where pytorch wheel aaarch broke the agent (by default the agent in docker mode, will use the latest stable version, but not in venv mode)
Basically upgrade to the latest clearml-agent version it should solve the issue:
pip3 install -U clearml-agemnt==1.2.3BTW for future debugging, this is the interesting part of the log (Notice it is looking for the correct pytorch based on the auto detected cuda version, 11.5)
Then it failed because it found aarch instead of x86 (this is the bug that was fixed in the latest version)
1654011488836 sjc13-t04-mlt02:!6e9:gpu0 DEBUG Torch CUDA 115 download page found Found PyTorch version torch==1.11.0 matching CUDA version 115 [31mERROR: torch-1.11.0-cp39-cp39-manylinux2014_aarch64.whl is not a supported wheel on this platform.[0m[31m

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi Danil,
You can use the following env variable to set it 🙂
CLEARML_AGENT_SKIP_PIP_VENV_INSTALL

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

AgitatedDove14 agent doesn't even try to resolve this conflict, is directly install wrong version of pytorch. I'm running agent inside docker.
Unfortunately, right now I can not attach the logs, I will attach them a little later.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StickyBlackbird93
				
					0
					 × 1

Write your answer

981 Views

7 Answers

2 years ago

one year ago