I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

Answered

I'm using Tensorboard SummaryWriter to add scalar metrics for the experiment. if experiment crashed, and I want to continue it from checkpoint, for some reason it plots metrics in a really weird way. even though I pass global_step=epoch to the SummaryWriter (15 in this case), metric values for the new epoch are being plotted somewhere at 98k instead of 15. anyone got ideas what I am doing wrong?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Votes Newest

Answers 29

maybe I should use explicit reporting instead of Tensorboard

It will do just the same 😞

there is no method for setting

last iteration

, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?

Let me double check that...

overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...

That is a very good point

but for the metrics, I explicitly pass the number of epoch that my training is currently on...

Yes so the idea it already "knows" where you stopped, so when you are reporting "iteration 1" it knows it's actually 0+previous_last_iteration ...

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

https://github.com/allegroai/clearml/issues/496

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

okay, I will open an issue

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

I think we should open a GitHub Issue and get some more feedback, maybe we should just add support in the backend side ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yep it should :)
I assume you add the previous iteration somewhere else, and this is the cause for the issue?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

😞 DilapidatedDucks58 how exactly are you "relaunching/continue" the execution? And what exactly are you setting?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass

No worries I totally feel you.
As a quick hack in the actual code of the Task itself, is it reasonable to have:
task = Task.init(....) task.set_initial_iteration(0)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

still no luck, I tried everything =( any updates?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

this would be great. I could just then pass it as a hyperparameter

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

DilapidatedDucks58 by default if you continue to execution, it will automatically continue reporting from the last iteration . I think this is what you are seeing

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thank you DilapidatedDucks58 for the ping!
totally slipped my mind 😞

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

thank you, I'll let you know if setting it to zero worked

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration

but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Hi DilapidatedDucks58
apologies, this thread slipped way.
I double checked, there server will not allow you to overwrite it (meaning to have it fixed will need to release a server version which usually takes longer)
That said maybe we can pass an argument to the "Task.init" so it ignores it? wdyt?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Lol, :)
I think the issue is that you do not need to manually set the initial iteration, it's supposed to get it , as it is stored on the Task itself

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I use Docker for training, which means that log_dir contents are removed for the continued experiment btw

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

task = Task.get_task(task_id = args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task = task, queue_name = task.data.execution.queue)

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

perhaps I need to do task.set_initial_iteration(0)?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

okay, so if there’s no workaround atm, should I create a Github issue?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

maybe I should use explicit reporting instead of Tensorboard

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

it will probably screw up my resource monitoring plots, but well, who cares 😃

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Hmm I suspect the 'set_initial_iteration' does not change/store the state on the Task, so when it is launched, the value is not overwritten. Could you maybe open a GitHub issue on it?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Many thanks!

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

this is how it looks if I zoom in on the epochs that ran before the crash

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

nope, didn't work =(

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass 😃

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

does this mean that setting initial iteration to 0 should help?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Write your answer

1K Views

29 Answers

3 years ago

2 years ago