Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

I'm using Tensorboard SummaryWriter to add scalar metrics for the experiment. if experiment crashed, and I want to continue it from checkpoint, for some reason it plots metrics in a really weird way. even though I pass global_step=epoch to the SummaryWriter (15 in this case), metric values for the new epoch are being plotted somewhere at 98k instead of 15. anyone got ideas what I am doing wrong?

  
  
Posted 2 years ago
Votes Newest

Answers 29


DilapidatedDucks58 by default if you continue to execution, it will automatically continue reporting from the last iteration . I think this is what you are seeing

  
  
Posted 2 years ago

😞 DilapidatedDucks58 how exactly are you "relaunching/continue" the execution? And what exactly are you setting?

  
  
Posted 2 years ago

maybe I should use explicit reporting instead of Tensorboard

It will do just the same 😞

there is no method for setting 

last iteration

, which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?

Let me double check that...

overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine ...

That is a very good point

but for the metrics, I explicitly pass the number of epoch that my training is currently on...

Yes so the idea it already "knows" where you stopped, so when you are reporting "iteration 1" it knows it's actually 0+previous_last_iteration ...

  
  
Posted 2 years ago

I think we should open a GitHub Issue and get some more feedback, maybe we should just add support in the backend side ?

  
  
Posted 2 years ago

Thank you DilapidatedDucks58 for the ping!
totally slipped my mind 😞

  
  
Posted 2 years ago

okay, so if there’s no workaround atm, should I create a Github issue?

  
  
Posted 2 years ago

this would be great. I could just then pass it as a hyperparameter

  
  
Posted 2 years ago

nope, didn't work =(

  
  
Posted 2 years ago

Many thanks!

  
  
Posted 2 years ago

overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration

but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?

  
  
Posted 2 years ago

task = Task.get_task(task_id = args.task_id)
task.mark_started()
task.set_parameters_as_dict(
{
"General": {
"checkpoint_file": model.url,
"restart_optimizer": False,
}
}
)
task.set_initial_iteration(0)
task.mark_stopped()
Task.enqueue(task = task, queue_name = task.data.execution.queue)

  
  
Posted 2 years ago

Hi DilapidatedDucks58
apologies, this thread slipped way.
I double checked, there server will not allow you to overwrite it (meaning to have it fixed will need to release a server version which usually takes longer)
That said maybe we can pass an argument to the "Task.init" so it ignores it? wdyt?

  
  
Posted 2 years ago

Hmm I suspect the 'set_initial_iteration' does not change/store the state on the Task, so when it is launched, the value is not overwritten. Could you maybe open a GitHub issue on it?

  
  
Posted 2 years ago

there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?

  
  
Posted 2 years ago

it will probably screw up my resource monitoring plots, but well, who cares 😃

  
  
Posted 2 years ago

still no luck, I tried everything =( any updates?

  
  
Posted 2 years ago

not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer

  
  
Posted 2 years ago

sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass 😃

  
  
Posted 2 years ago

this is how it looks if I zoom in on the epochs that ran before the crash

  
  
Posted 2 years ago

thank you, I'll let you know if setting it to zero worked

  
  
Posted 2 years ago

maybe I should use explicit reporting instead of Tensorboard

  
  
Posted 2 years ago

perhaps I need to do task.set_initial_iteration(0)?

  
  
Posted 2 years ago

okay, I will open an issue

  
  
Posted 2 years ago

sorry that I keep bothering you, I love ClearML and try to promote it whenever I can, but this thing is a real pain in the ass 

No worries I totally feel you.
As a quick hack in the actual code of the Task itself, is it reasonable to have:
task = Task.init(....) task.set_initial_iteration(0)

  
  
Posted 2 years ago

Lol, :)
I think the issue is that you do not need to manually set the initial iteration, it's supposed to get it , as it is stored on the Task itself

  
  
Posted 2 years ago

Yep it should :)
I assume you add the previous iteration somewhere else, and this is the cause for the issue?

  
  
Posted 2 years ago

I use Docker for training, which means that log_dir contents are removed for the continued experiment btw

  
  
Posted 2 years ago

does this mean that setting initial iteration to 0 should help?

  
  
Posted 2 years ago
564 Views
29 Answers
2 years ago
one year ago
Tags