I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
this is how it looks if I zoom in on the epochs that ran before the crash
perhaps I need to do task.set_initial_iteration(0)?
DilapidatedDucks58 by default if you continue to execution, it will automatically continue reporting from the last iteration . I think this is what you are seeing
does this mean that setting initial iteration to 0 should help?
Yep it should :)
I assume you add the previous iteration somewhere else, and this is the cause for the issue?
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
thank you, I'll let you know if setting it to zero worked
it will probably screw up my resource monitoring plots, but well, who cares 😃
Lol, :)
I think the issue is that you do not need to manually set the initial iteration, it's supposed to get it , as it is stored on the Task itself
😞 DilapidatedDucks58 how exactly are you "relaunching/continue" the execution? And what exactly are you setting?
task
=
Task.get_task(task_id
=
args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task
=
task, queue_name
=
task.data.execution.queue)
Hmm I suspect the 'set_initial_iteration' does not change/store the state on the Task, so when it is launched, the value is not overwritten. Could you maybe open a GitHub issue on it?
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
maybe I should use explicit reporting instead of Tensorboard
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration
but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?
maybe I should use explicit reporting instead of Tensorboard
It will do just the same 😞
there is no method for settingÂ
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass the number of epoch that my training is currently on...
Yes so the idea it already "knows" where you stopped, so when you are reporting "iteration 1" it knows it's actually 0+previous_last_iteration
...
okay, so if there’s no workaround atm, should I create a Github issue?
still no luck, I tried everything =( any updates?
Hi DilapidatedDucks58
apologies, this thread slipped way.
I double checked, there server will not allow you to overwrite it (meaning to have it fixed will need to release a server version which usually takes longer)
That said maybe we can pass an argument to the "Task.init" so it ignores it? wdyt?
this would be great. I could just then pass it as a hyperparameter
I think we should open a GitHub Issue and get some more feedback, maybe we should just add support in the backend side ?
Thank you DilapidatedDucks58 for the ping!
totally slipped my mind 😞
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass 😃
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the assÂ
No worries I totally feel you.
As a quick hack in the actual code of the Task itself, is it reasonable to have:task = Task.init(....) task.set_initial_iteration(0)