not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Thank you DilapidatedDucks58 for the ping!
totally slipped my mind 😞
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration
but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?
okay, so if there’s no workaround atm, should I create a Github issue?
Hmm I suspect the 'set_initial_iteration' does not change/store the state on the Task, so when it is launched, the value is not overwritten. Could you maybe open a GitHub issue on it?
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the assÂ
No worries I totally feel you.
As a quick hack in the actual code of the Task itself, is it reasonable to have:task = Task.init(....) task.set_initial_iteration(0)
it will probably screw up my resource monitoring plots, but well, who cares 😃
I think we should open a GitHub Issue and get some more feedback, maybe we should just add support in the backend side ?
Lol, :)
I think the issue is that you do not need to manually set the initial iteration, it's supposed to get it , as it is stored on the Task itself
still no luck, I tried everything =( any updates?
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass 😃
perhaps I need to do task.set_initial_iteration(0)?
Hi DilapidatedDucks58
apologies, this thread slipped way.
I double checked, there server will not allow you to overwrite it (meaning to have it fixed will need to release a server version which usually takes longer)
That said maybe we can pass an argument to the "Task.init" so it ignores it? wdyt?
this would be great. I could just then pass it as a hyperparameter
DilapidatedDucks58 by default if you continue to execution, it will automatically continue reporting from the last iteration . I think this is what you are seeing
Yep it should :)
I assume you add the previous iteration somewhere else, and this is the cause for the issue?
does this mean that setting initial iteration to 0 should help?
this is how it looks if I zoom in on the epochs that ran before the crash
thank you, I'll let you know if setting it to zero worked
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
maybe I should use explicit reporting instead of Tensorboard
It will do just the same 😞
there is no method for settingÂ
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass the number of epoch that my training is currently on...
Yes so the idea it already "knows" where you stopped, so when you are reporting "iteration 1" it knows it's actually 0+previous_last_iteration
...
maybe I should use explicit reporting instead of Tensorboard
task
=
Task.get_task(task_id
=
args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task
=
task, queue_name
=
task.data.execution.queue)
😞 DilapidatedDucks58 how exactly are you "relaunching/continue" the execution? And what exactly are you setting?