sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass
No worries I totally feel you.
As a quick hack in the actual code of the Task itself, is it reasonable to have:task = Task.init(....) task.set_initial_iteration(0)
sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass 😃
Thank you DilapidatedDucks58 for the ping!
totally slipped my mind 😞
I think we should open a GitHub Issue and get some more feedback, maybe we should just add support in the backend side ?
this would be great. I could just then pass it as a hyperparameter
Hi DilapidatedDucks58
apologies, this thread slipped way.
I double checked, there server will not allow you to overwrite it (meaning to have it fixed will need to release a server version which usually takes longer)
That said maybe we can pass an argument to the "Task.init" so it ignores it? wdyt?
still no luck, I tried everything =( any updates?
okay, so if there’s no workaround atm, should I create a Github issue?
maybe I should use explicit reporting instead of Tensorboard
It will do just the same 😞
there is no method for setting
last iteration
, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Let me double check that...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...
That is a very good point
but for the metrics, I explicitly pass the number of epoch that my training is currently on...
Yes so the idea it already "knows" where you stopped, so when you are reporting "iteration 1" it knows it's actually 0+previous_last_iteration
...
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration
but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?
maybe I should use explicit reporting instead of Tensorboard
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
Hmm I suspect the 'set_initial_iteration' does not change/store the state on the Task, so when it is launched, the value is not overwritten. Could you maybe open a GitHub issue on it?
task
=
Task.get_task(task_id
=
args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task
=
task, queue_name
=
task.data.execution.queue)
😞 DilapidatedDucks58 how exactly are you "relaunching/continue" the execution? And what exactly are you setting?
Lol, :)
I think the issue is that you do not need to manually set the initial iteration, it's supposed to get it , as it is stored on the Task itself
it will probably screw up my resource monitoring plots, but well, who cares 😃
thank you, I'll let you know if setting it to zero worked
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
Yep it should :)
I assume you add the previous iteration somewhere else, and this is the cause for the issue?
does this mean that setting initial iteration to 0 should help?
DilapidatedDucks58 by default if you continue to execution, it will automatically continue reporting from the last iteration . I think this is what you are seeing
perhaps I need to do task.set_initial_iteration(0)?
this is how it looks if I zoom in on the epochs that ran before the crash
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw