Hi RipeGoose2
Could you expand on "inconsistency in the iteration reporting" ? Also "calling trainer.fit multiple" would you expect it to show as a single experiment or is it kind of param search ?
AgitatedDove14 a single experiment, that is being paused and resumed.
inconsistrncy in yhe reporting: when resuming the 10th epoch for example and doing an extra epoch clearml iteration count is wrong for debug images and monitored metrics.. somehow not for the scalar reporting
so it sounds like there is no known issue related to this
Hi RipeGoose2
Are you continuing the Task, i.e. passing Task.init(..., continue_last_task=True)
Hi AgitatedDove14 , the initialization of task happens once before the multiple trainings..
` Task.init
trainer.fit(model)
something
trainer.fit(model)
... `
I assume every fit starts reporting from step 0 , so they override one another. Could it be?
AgitatedDove14 in terms of explicit reporting I'm using the current_epoch which is correct when I check it in debug mode
and also in terms of outcome, the scalars follow the correct epoch count, but the debug samples and monitored performance metric show a different count
but the debug samples and monitored performance metric show a different count
Hmm could you expand on what you are getting, and what you are expecting to get
Hi AgitatedDove14 , so it looks something like this:
` Task.init
trainer.fit(model) # clearml logging starts from 0 and logs all summaries correctly according to real count
triggered fit stopping at epoch=n
something
trainer.fit(model) # clearml logging starts from n+n (thats how it seems) for non explicit scalar summaries (debug samples, scalar resources monitoring, and also global iteration count)
triggered fit stopping
... `I am at the moment diverging from this implementation to something else, so personally it wouldn't be an issue for me.. I'm reporting it because it might be useful for someone in the future
Thanks RipeGoose2 !
clearml logging starts from n+n (thats how it seems) for non explicit
I have to say it looks like the expected behavior , I think.
Basically matching the TB, no?
AgitatedDove14 no it has an offset of the value that it started with, so for example you stopped at n, then when you are running the n+1 epoch you get the 2*n+1 reported
when you are running the n+1 epoch you get the 2*n+1 reported
RipeGoose2 like twice the gap, i.e internally it adds the an offset of the last iteration... is this easily reproducible ?
AgitatedDove14 should be, I'll try to create a small example later today or tomorrow