Hi All, I'Ve Successfully Run A Task Locally, And Now I'M Trying To Clone It And Send It To A Queue. It Looks Like The Environment Is Built Successfully, But It Hangs Here:

Unanswered

Can this be reproducible using a simple script that we can also run?

Not really unfortunately - happy to share my code, but I've managed to reproduce this with different codebases.

As a summary of what I've tried:

Agent on the H100 machine, Server on Kube - Fail
Agent on laptop, Server on Kube - Fail
Agent on laptop, Server on Docker Desktop - Pass
So I'm 100% sure there is something wrong with our ClearML Server deployment on Kube rather than an issue with the agents or code. As for which of the 7 containers could be at fault... :man-shrugging: . I'm not seeing anything out of the ordinary in the logs. Is there a verbose setting in the agent that could help us diagnose, i.e. each step of what goes on in Task.init ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					DepravedBee82
				
					0
					 × 1

311 Views

0 Answers

one year ago