Hi! Any Idea Why Clearml Fails To Detect Iteration Reporting?

Answered

Hi! Any idea why clearml fails to detect iteration reporting?
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-startI am using pytorch_lightning

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

Votes Newest

Answers 21

GrievingTurkey78 , do you have iterations stated explicitly somewhere in the script?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

GrievingTurkey78 , the default is 3 minutes. You can try setting to a long enough time to make sure it doesn't skip the epoch 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Sure! Could you point me out how its done

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

Last question CostlyOstrich36 sorry to poke you! Seems even though if I set an extremely long time it will still fail when the first plots are reported. The first plots are generated automatically by pytorch lightning and track the cpu and gpu usage. Do you think this could be the cause? or should it also detect the iteration.

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

Yes CostlyOstrich36

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

Oh I think I am wrong! Then it must be the clearml monitoring. Still it fails way before the timer ends.

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

GrievingTurkey78 , did you try calling task.set_resource_monitor_iteration_timeout after the task init?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Hey CostlyOstrich36 I am doing a lot of things before the first plot is reported! Is the seconds_from_start parameter unbounded? What should I do if it takes a lot of time to report the first plot?

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

I set it to 200000 ! But the problem stems from when the first plot is the clearml cpu and gpu monitoring, were you able to reproduce it? Even if I set the number fairly large when the monitoring plot was reported the message appeared.

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

GrievingTurkey78 , please try task.init( auto_resource_monitoring=False, ... )

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Thanks!

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

I'll give that a try! Thanks CostlyOstrich36

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

GrievingTurkey78 , can it be a heavy calculation that takes time? ClearML has a fallback to time instead of iterations if a certain timeout has passed. You can configure it with task.set_resource_monitor_iteration_timeout(seconds_from_start=<TIME_IN_SECONDS>)

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Thanks 🙌

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

GrievingTurkey78 , let me take a look into it 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

GrievingTurkey78 , what timeout did you set? Please note that it's in seconds so it needs to be a fairly large number

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

CostlyOstrich36 That seemed to do the job! No message after the first epoch, with the caveat of losing resource monitoring. Any idea of what could be causing this? If the resource monitor is the first plot then the iteration detection will fail? Are there any hacks to keep the resource monitoring? Thanks a lot! 🙌

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

GrievingTurkey78 , can you try disabling the cpu/gpu detection?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

CostlyOstrich36 Pytorch lightning exposes the current_epoch in the trainer, not sure if that is what you mean.

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

I set the number to a crazy value and it fails around the same iteration

  				
Posted 
	3 years ago

					More  		
  Report
		
					GrievingTurkey78
				
					0
					 × 1

GrievingTurkey78 , I'm not sure. Let me check.
Do you have cpu/gpu tracking through both pytorch lightning AND ClearML reported in your task?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

1K Views

21 Answers

3 years ago

2 years ago