Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi I Came Across Some Inconsistency In The Iteration Reporting In The Clearml With Pytorch-Lightning When Calling Trainer.Fit Multiple Times, Before I Dive In I Wondered If There Is A Known Issue Related To This?

Hi I came across some inconsistency in the iteration reporting in the ClearML with pytorch-lightning when calling trainer.fit multiple times, before I dive in I wondered if there is a known issue related to this?

  
  
Posted 3 years ago
Votes Newest

Answers 15


Thanks RipeGoose2 !

clearml logging starts from n+n (thats how it seems) for non explicit

I have to say it looks like the expected behavior , I think.
Basically matching the TB, no?

  
  
Posted 3 years ago

and also in terms of outcome, the scalars follow the correct epoch count, but the debug samples and monitored performance metric show a different count

  
  
Posted 3 years ago

Hi RipeGoose2
Are you continuing the Task, i.e. passing Task.init(..., continue_last_task=True)

  
  
Posted 3 years ago

Thanks!!

  
  
Posted 3 years ago

Hi AgitatedDove14 , the initialization of task happens once before the multiple trainings..
` Task.init
trainer.fit(model)

something

trainer.fit(model)
... `

  
  
Posted 3 years ago

AgitatedDove14 in terms of explicit reporting I'm using the current_epoch which is correct when I check it in debug mode

  
  
Posted 3 years ago

I assume every fit starts reporting from step 0 , so they override one another. Could it be?

  
  
Posted 3 years ago

AgitatedDove14 no it has an offset of the value that it started with, so for example you stopped at n, then when you are running the n+1 epoch you get the 2*n+1 reported

  
  
Posted 3 years ago

Hi AgitatedDove14 , so it looks something like this:
` Task.init
trainer.fit(model) # clearml logging starts from 0 and logs all summaries correctly according to real count

triggered fit stopping at epoch=n

something

trainer.fit(model) # clearml logging starts from n+n (thats how it seems) for non explicit scalar summaries (debug samples, scalar resources monitoring, and also global iteration count)

triggered fit stopping

... `I am at the moment diverging from this implementation to something else, so personally it wouldn't be an issue for me.. I'm reporting it because it might be useful for someone in the future

  
  
Posted 3 years ago

Hi RipeGoose2
Could you expand on "inconsistency in the iteration reporting" ? Also "calling trainer.fit multiple" would you expect it to show as a single experiment or is it kind of param search ?

  
  
Posted 3 years ago

so it sounds like there is no known issue related to this

  
  
Posted 3 years ago

AgitatedDove14 should be, I'll try to create a small example later today or tomorrow

  
  
Posted 3 years ago

but the debug samples and monitored performance metric show a different count

Hmm could you expand on what you are getting, and what you are expecting to get

  
  
Posted 3 years ago

AgitatedDove14 a single experiment, that is being paused and resumed.
inconsistrncy in yhe reporting: when resuming the 10th epoch for example and doing an extra epoch clearml iteration count is wrong for debug images and monitored metrics.. somehow not for the scalar reporting

  
  
Posted 3 years ago

when you are running the n+1 epoch you get the 2*n+1 reported
RipeGoose2 like twice the gap, i.e internally it adds the an offset of the last iteration... is this easily reproducible ?

  
  
Posted 3 years ago
1K Views
15 Answers
3 years ago
2 years ago
Tags