The question is — are there any workarounds to set last iteration to correct value. And preferably do it in a simple way (i.e. not setting it manually).
I tried it, but unfortunately, this way it only sets last iteration to 0 instead of using last iteration from TensorBoard and simply rewrites logs. Expected behaviour is that it reads last iteration correctly. At least it is stated in docs so.
Thanks Martin. I tried to rerun everything from scratch using continue_last_task=0
and looks like it helped a lot but not completely. You can see in attached screenshot that gaps in iteration axis are still a little bigger than expected. I’v rerun it two times.
No, I don’t need last iteration set to zero. All I need is to ClearML correctly initialize it from TensorBoard (or from wherever it initializes it). When I train model, stop training and then resume it, ClearML instead of using last iteration doubles (I guess) it. And this can be seen in attached screenshot in GitHub issue.
VivaciousWalrus21 I took a look at your example from the github issue:
https://github.com/allegroai/clearml/issues/762#issuecomment-1237353476
It seems to do exactly what you expect. and stores its own last iteration as part of the checkpoint. When running the example with continue_last_task=int(0)
you get exactly what you expect
(Do notice that TB visualizes these graphs in a very odd way, and it took me a few clicks to verify it...)
Oh sorry, from the docstring, this will work:
` :param bool continue_last_task: Continue the execution of a previously executed Task (experiment)
.. note::
When continuing the executing of a previously executed Task,
all previous artifacts / models/ logs are intact.
New logs will continue iteration/step based on the previous-execution maximum iteration value.
For example:
The last train/loss scalar reported was iteration 100, the next report will be iteration 101.
The values are:
- ``True`` - Continue the last Task ID.
specified explicitly by reuse_last_task_id or implicitly with the same logic as reuse_last_task_id
- ``False`` - Overwrite the execution of previous Task (default).
- A string - You can also specify a Task ID (string) to be continued.
This is equivalent to `continue_last_task=True` and `reuse_last_task_id=a_task_id_string`.
- An integer - Specify initial iteration offset (override the auto automatic last_iteration_offset)
Pass 0, to disable the automatic last_iteration_offset or specify a different initial offset
You can specify a Task ID to be used with `reuse_last_task_id='task_id_here'` `
Notice we are actually setting the last iteration manually at initialization time, should do the tricktask = Task.init(project_name='OCR/CRNN', task_type='training', task_name='CRNN from scratch', reuse_last_task_id=True, continue_last_task=int(0))
Hi VivaciousWalrus21
After restarting training huge gaps appear in iteration axis (see the screenshot).
The Task.init
actually tries to understand what was the last reported interation and continue from that iteration, I'm assuming that what happens is that your code does that also, which creates a "double shift" that you see as the jump. I think the next version will try to be "smarter" about it, and detect this double gap.
In the meantime, you can do:task = Task.init(...) task.set_initial_iteration(0)
wdyt?
Hi Martin, thanks for the response! Nope, setting initial iteration didn’t solve the problem.
Hi VivaciousWalrus21 I tested the sample code, and the gap was evident in Tensorboard as well. This is not clearml generating this jump this is internal (like the auto de/serialization and continue of the code base)
Expected behaviour is that it reads last iteration correctly. At least it is stated in docs so.
This is exactly what should happen, are you saying that for some reason it fails?