Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have Such A Problem, After I Restore The Experiment From The Checkpoint, My Scalar Metrics Have Gaps Due To The Fact That My Iterations Are Not Zero. If The Smart Way Is How To Get Rid Of It?

Hi, I have such a problem, after I restore the experiment from the checkpoint, my scalar metrics have gaps due to the fact that my iterations are not zero. If the smart way is how to get rid of it?

  
  
Posted 3 years ago
Votes Newest

Answers 18


Hi SourOx12
How do you set the iteration when you continue the experiment? is it with Task.init continue_last_task ?

  
  
Posted 3 years ago

AgitatedDove14
Yes, i use continue_last_task with reuse_last_task_id. The iteration number is the actual number of batches that were used, or the number of the epoch at which the training stopped. The iterations are served sequentially, but for some reason there is a gap in this picture

  
  
Posted 3 years ago

Hmm, I see the jump from 50 to 100, is that consistent with the last iteration on the aborted Task (before continuing )?

  
  
Posted 3 years ago

AgitatedDove14
The gap is always equal to the number of iterations completed before continuing training

  
  
Posted 3 years ago

SourOx12
Hmmm. So if last iteration was 75, the next iteration (after we continue) will be 150 ?

  
  
Posted 3 years ago

AgitatedDove14
Yes (if value of first iter is 0)

  
  
Posted 3 years ago

Okay let me check....

  
  
Posted 3 years ago

So the thing is clearml automatically detects the last iteration of the previous run, my assumption you also add it hence the double shift.
SourOx12 could that be it?

  
  
Posted 3 years ago

Sorry to answer so late AgitatedDove14
I also thought so and tried this thing:
` !pip install clearml
import clearml
id_last_start = '873add629cf44cd5ab2ef383c94b1c'

clearml.Task.set_credentials(...)
if id_last_start != '':

task = clearml.Task.get_task(task_id=id_last_start,project_name='tests', task_name='patience: 5 factor:0.5')

task = clearml.Task.init(project_name='Exp with ROP',
                         task_name='patience: 2 factor:0.75',
                         continue_last_task=True,
                         reuse_last_task_id=id_last_start,
                        )

else:
task = clearml.Task.init(project_name='tests', task_name='patience: 2 factor:0.75')
cfg.task = task
cfg.my_writer =task.get_logger()

def to_logger(path,step,val,cfg):
folder = path[:path.find('/')]
file = path[path.find('/')+1:]
if step == cfg.epoch:
step = step - cfg.start_epoch + 1

print(step,val)

    clearml.Logger.current_logger().report_scalar(folder, file, iteration=step, value=val)
elif step == cfg.step:
    step = step - cfg.start_step + 1 
    clearml.Logger.current_logger().report_scalar(folder, file, iteration=step, value=val) `And after restarting, I get these breaks in Scalars:  https://app.community.clear.ml/projects/2d68c58ff6f14403b51ff4c2d0b4f626/experiments/873add629cf44cd5ab2ef383c94b1c9b/output/execution
  
  
Posted 3 years ago

Hi SourOx12
I think that you do not actually need this one:
step = step - cfg.start_epoch + 1you can just do
step += 1ClearML Will take care of the offset itself

  
  
Posted 3 years ago

AgitatedDove14 This does not solve the problem unfortunately:( New exp: https://app.community.clear.ml/projects/2d68c58ff6f14403b51ff4c2d0b4f626/experiments/ec096e98ed5c4eccaf8047673023fc3e/output/execution
The image shows the eval log. The second column is val, the third column is step

  
  
Posted 3 years ago

I'm not sure I follow the example... Are you sure this experiment continued a previous run?
What was the last iteration on the previous run ?

  
  
Posted 3 years ago

AgitatedDove14
The last iteration before the restore was 2. Starting from the 3rd iteration, this is the restored model

  
  
Posted 3 years ago

AgitatedDove14
Can you please give some code examples where the training restore, because I haven't found any? I will be very grateful

  
  
Posted 3 years ago

Give me a minute, I'll check something

  
  
Posted 3 years ago

SourOx12
Run this example:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py
Once, then change line #26 to:
task = Task.init(project_name="examples", task_name="scalar reporting", continue_last_task=True)and run again,

  
  
Posted 3 years ago

Hi AgitatedDove14 I finally found a solution to the problem. I should have written task.set_initial_iteration(0) after restore task. Thank you for your help

  
  
Posted 3 years ago

Nice SourOx12 !

  
  
Posted 3 years ago
924 Views
18 Answers
3 years ago
one year ago
Tags