Hi! I Have Some Agents On Gcp. Lately I Have Been Getting Some Experiments That Simply Stop Running (No Signs That The Experiment Crashed). Here Is A Plot That Shows The Resource Monitoring. Any Ideas On What Could Be Causing This?

Answered

Hi! I have some agents on GCP. Lately I have been getting some experiments that simply stop running (no signs that the experiment crashed). Here is a plot that shows the resource monitoring. Any ideas on what could be causing this?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

Votes Newest

Answers 6

GrievingTurkey78 , what framework are you working with? Can you provide some more information regarding your environment - linux/windows, pip/conda? Can you provide maybe a snippet of your code I can try to run to reproduce?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I am using pytorch_lightning , I'll try to create a snippet I can share! Thanks 🙌

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

Yeah, I experienced the same issue. Training stopps / freezes at the end of the 10th, or 15th epoch. Using pytorch_lightning as well.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WittyOwl57
				
					0
					 × 1

GrievingTurkey78 Hi!

What versions of clearml and clearml-agent are you using? Also for how long were the experiments were going?

Seems like agent is still reporting iterations and usage for the experiment so what do you mean by stopped?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

You can check the run time by switching to 'wall time' axis 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hey CostlyOstrich36 ! I am using clearml==1.1.2 and clearml-agent==1.1.0 . Stopped is not the right word, more like frozen, it just froze at an epoch. The console on the agent shows epoch 33 first batch and the one at the server epoch 32 last batch. The experiment was running for ~6 hours.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

Write your answer

2K Views

6 Answers

4 years ago

2 years ago