Hi Guys, Does Anybody Have The Same Issue Like Me? Is There Any Workaround?

Answered

Hi guys, does anybody have the same issue like me? Is there any workaround? https://github.com/allegroai/clearml/issues/762

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousWalrus21
				
					0
					 × 1

Votes Newest

Answers 12

The question is — are there any workarounds to set last iteration to correct value. And preferably do it in a simple way (i.e. not setting it manually).

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousWalrus21
				
					0
					 × 1

I tried it, but unfortunately, this way it only sets last iteration to 0 instead of using last iteration from TensorBoard and simply rewrites logs. Expected behaviour is that it reads last iteration correctly. At least it is stated in docs so.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousWalrus21
				
					0
					 × 1

Thanks Martin. I tried to rerun everything from scratch using continue_last_task=0 and looks like it helped a lot but not completely. You can see in attached screenshot that gaps in iteration axis are still a little bigger than expected. I’v rerun it two times.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousWalrus21
				
					0
					 × 1

No, I don’t need last iteration set to zero. All I need is to ClearML correctly initialize it from TensorBoard (or from wherever it initializes it). When I train model, stop training and then resume it, ClearML instead of using last iteration doubles (I guess) it. And this can be seen in attached screenshot in GitHub issue.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousWalrus21
				
					0
					 × 1

VivaciousWalrus21 I took a look at your example from the github issue:
https://github.com/allegroai/clearml/issues/762#issuecomment-1237353476
It seems to do exactly what you expect. and stores its own last iteration as part of the checkpoint. When running the example with continue_last_task=int(0) you get exactly what you expect
(Do notice that TB visualizes these graphs in a very odd way, and it took me a few clicks to verify it...)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Oh sorry, from the docstring, this will work:
` :param bool continue_last_task: Continue the execution of a previously executed Task (experiment)

.. note::
    When continuing the executing of a previously executed Task,
    all previous artifacts / models/ logs are intact.
    New logs will continue iteration/step based on the previous-execution maximum iteration value.
    For example:
    The last train/loss scalar reported was iteration 100, the next report will be iteration 101.

The values are:

- ``True`` - Continue the last Task ID.
    specified explicitly by reuse_last_task_id or implicitly with the same logic as reuse_last_task_id
- ``False`` - Overwrite the execution of previous Task  (default).
- A string - You can also specify a Task ID (string) to be continued.
    This is equivalent to `continue_last_task=True` and `reuse_last_task_id=a_task_id_string`.
- An integer - Specify initial iteration offset (override the auto automatic last_iteration_offset)
    Pass 0, to disable the automatic last_iteration_offset or specify a different initial offset
    You can specify a Task ID to be used with `reuse_last_task_id='task_id_here'` `

Notice we are actually setting the last iteration manually at initialization time, should do the trick
task = Task.init(project_name='OCR/CRNN', task_type='training', task_name='CRNN from scratch', reuse_last_task_id=True, continue_last_task=int(0))

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi VivaciousWalrus21

After restarting training huge gaps appear in iteration axis (see the screenshot).

The Task.init actually tries to understand what was the last reported interation and continue from that iteration, I'm assuming that what happens is that your code does that also, which creates a "double shift" that you see as the jump. I think the next version will try to be "smarter" about it, and detect this double gap.
In the meantime, you can do:
task = Task.init(...) task.set_initial_iteration(0)wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi Martin, thanks for the response! Nope, setting initial iteration didn’t solve the problem.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousWalrus21
				
					0
					 × 1

My pleasure 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks a lot for your help!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousWalrus21
				
					0
					 × 1

Hi VivaciousWalrus21 I tested the sample code, and the gap was evident in Tensorboard as well. This is not clearml generating this jump this is internal (like the auto de/serialization and continue of the code base)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Expected behaviour is that it reads last iteration correctly. At least it is stated in docs so.

This is exactly what should happen, are you saying that for some reason it fails?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

932 Views

12 Answers

2 years ago

one year ago