Hi Everyone, I'M Running Into A Weird Error When Trying To Clone And Run And Task That Has Completed Successfully. I Have A Test Task That Loads A Dummy Dataset And Trains A Toy Model With Pytorch. When Running Remotely, I Use My Own Docker Image That Has

Answered

Hi everyone, I'm running into a weird error when trying to clone and run and task that has completed successfully. I have a test task that loads a dummy dataset and trains a toy model with PyTorch. When running remotely, I use my own docker image that has all of the python packages already installed in a virtual environment. I set this environment variable for the task: -e CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/opt/virtualenvs/clearml-venv/bin/python and also include my repo as a python package to install with task.set_packages(["-e .")]. I am able to use my local VSCode session to run the task remotely on an agent withtask.execute_remotely()` . Everything runs just fine and completes. Then when I try to clone the task and run it again with the agent, I get this error during the requirements parsing:

clearml_agent: Warning: could not resolve python wheel replacement for torch==2.0.1
clearml_agent: ERROR: Could not install task requirements!
invalid python version '' defined in configuration file, key 'agent.default_python': must have both major and minor parts of the version (for example: '3.7')
2024-04-10 11:44:31
Process failed, exit code 1

The weird thing is that the log also shows that the agent.default_python`` value is set:agent.default_python = 3.10` . Maybe that value needs to be a string instead? But I don't understand why I run into this problem only when cloning. Shouldn't the cloned version run exactly the same as the initial remote run? Thoughts are appreciated!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NuttyLobster9
				
					0
					 × 1

Votes Newest

Answers 8

Okay so I discovered that setting -e CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE=none solves the issue.

That said, if someone could explain to me why this error was occurring and why it only happens in the case of cloning, I'd love to understand. Thanks!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NuttyLobster9
				
					0
					 × 1

Sure thing, anyhow we will fix this bug so next version there is no need for a workaround (but the workaround will still hold so you won't need to change anything)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1533620191232004096:profile|NuttyLobster9> it's a bit hard to say and the full log would be very helpful - can you perhaps remove all secrets and send it in a DM so it will not be public in the channel? I assume local paths etc. are less sensitive in a DM

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi @<1533620191232004096:profile|NuttyLobster9>
First nice workaround!
Second could you send the full log? When the venv is skipped then pytorch resolving should be skipped as well, and no error should be raised...
And Lastly could you also send the log of the task that executed correctly (the one you cloned), because you are correct it should have been the same

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Unfortunately, it's turning out to be quite time consuming to manually remove all of the private info in here. Is there a particular section of the log that would be useful to see? I can try to focus on just sharing that part.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NuttyLobster9
				
					0
					 × 1

@<1533620191232004096:profile|NuttyLobster9> I think we found the issue, when you are passing a direct link to the python venv, the agent fails to detect the python version and since the python version is required for fetching the correct torch it fails to install it. This is why passing CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE=none because it skipped resolving the torch / cuda version (that requires parsing the python version)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi Martin, I see . That makes sense though I would have expected the behavior to be the same when running remotely the first time as well . In any case, this solved the issue for me . Thanks for looking at it

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NuttyLobster9
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> , sure. I just need to scrape them for any sensitive info then i'll post to this thread. Thanks for your reply.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NuttyLobster9
				
					0
					 × 1

Write your answer

2K Views

8 Answers

one year ago