Answered

Hi I Came Across Some Inconsistency In The Iteration Reporting In The Clearml With Pytorch-Lightning When Calling Trainer.Fit Multiple Times, Before I Dive In I Wondered If There Is A Known Issue Related To This?

Hi I came across some inconsistency in the iteration reporting in the ClearML with pytorch-lightning when calling trainer.fit multiple times, before I dive in I wondered if there is a known issue related to this?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

Votes Newest

Answers 15

Hi RipeGoose2
Could you expand on "inconsistency in the iteration reporting" ? Also "calling trainer.fit multiple" would you expect it to show as a single experiment or is it kind of param search ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 a single experiment, that is being paused and resumed.
inconsistrncy in yhe reporting: when resuming the 10th epoch for example and doing an extra epoch clearml iteration count is wrong for debug images and monitored metrics.. somehow not for the scalar reporting

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RipeGoose2
				
					0
					 × 1

so it sounds like there is no known issue related to this

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					RipeGoose2
				
					0
					 × 1

Hi RipeGoose2
Are you continuing the Task, i.e. passing Task.init(..., continue_last_task=True)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 , the initialization of task happens once before the multiple trainings..
` Task.init
trainer.fit(model)

something

trainer.fit(model)
... `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

I assume every fit starts reporting from step 0 , so they override one another. Could it be?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 in terms of explicit reporting I'm using the current_epoch which is correct when I check it in debug mode

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

and also in terms of outcome, the scalars follow the correct epoch count, but the debug samples and monitored performance metric show a different count

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

but the debug samples and monitored performance metric show a different count

Hmm could you expand on what you are getting, and what you are expecting to get

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 , so it looks something like this:
` Task.init
trainer.fit(model) # clearml logging starts from 0 and logs all summaries correctly according to real count

triggered fit stopping at epoch=n

something

trainer.fit(model) # clearml logging starts from n+n (thats how it seems) for non explicit scalar summaries (debug samples, scalar resources monitoring, and also global iteration count)

triggered fit stopping

... `I am at the moment diverging from this implementation to something else, so personally it wouldn't be an issue for me.. I'm reporting it because it might be useful for someone in the future

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

Thanks RipeGoose2 !

clearml logging starts from n+n (thats how it seems) for non explicit

I have to say it looks like the expected behavior , I think.
Basically matching the TB, no?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 no it has an offset of the value that it started with, so for example you stopped at n, then when you are running the n+1 epoch you get the 2*n+1 reported

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

when you are running the n+1 epoch you get the 2*n+1 reported
RipeGoose2 like twice the gap, i.e internally it adds the an offset of the last iteration... is this easily reproducible ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 should be, I'll try to create a small example later today or tomorrow

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

Thanks!!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

15 Answers

4 years ago

2 years ago