Answered

Hi, I Have Such A Problem, After I Restore The Experiment From The Checkpoint, My Scalar Metrics Have Gaps Due To The Fact That My Iterations Are Not Zero. If The Smart Way Is How To Get Rid Of It?

Hi, I have such a problem, after I restore the experiment from the checkpoint, my scalar metrics have gaps due to the fact that my iterations are not zero. If the smart way is how to get rid of it?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SourOx12
				
					0
					 × 1

Votes Newest

Answers 18

Hi SourOx12
How do you set the iteration when you continue the experiment? is it with Task.init continue_last_task ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14
Yes, i use continue_last_task with reuse_last_task_id. The iteration number is the actual number of batches that were used, or the number of the epoch at which the training stopped. The iterations are served sequentially, but for some reason there is a gap in this picture

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SourOx12
				
					0
					 × 1

Hmm, I see the jump from 50 to 100, is that consistent with the last iteration on the aborted Task (before continuing )?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14
The gap is always equal to the number of iterations completed before continuing training

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SourOx12
				
					0
					 × 1

SourOx12
Hmmm. So if last iteration was 75, the next iteration (after we continue) will be 150 ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14
Yes (if value of first iter is 0)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SourOx12
				
					0
					 × 1

Okay let me check....

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So the thing is clearml automatically detects the last iteration of the previous run, my assumption you also add it hence the double shift.
SourOx12 could that be it?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sorry to answer so late AgitatedDove14
I also thought so and tried this thing:
` !pip install clearml
import clearml
id_last_start = '873add629cf44cd5ab2ef383c94b1c'

clearml.Task.set_credentials(...)
if id_last_start != '':

task = clearml.Task.get_task(task_id=id_last_start,project_name='tests', task_name='patience: 5 factor:0.5')

task = clearml.Task.init(project_name='Exp with ROP',
                         task_name='patience: 2 factor:0.75',
                         continue_last_task=True,
                         reuse_last_task_id=id_last_start,
                        )

else:
task = clearml.Task.init(project_name='tests', task_name='patience: 2 factor:0.75')
cfg.task = task
cfg.my_writer =task.get_logger()

def to_logger(path,step,val,cfg):
folder = path[:path.find('/')]
file = path[path.find('/')+1:]
if step == cfg.epoch:
step = step - cfg.start_epoch + 1

print(step,val)

    clearml.Logger.current_logger().report_scalar(folder, file, iteration=step, value=val)
elif step == cfg.step:
    step = step - cfg.start_step + 1 
    clearml.Logger.current_logger().report_scalar(folder, file, iteration=step, value=val) `And after restarting, I get these breaks in Scalars:  https://app.community.clear.ml/projects/2d68c58ff6f14403b51ff4c2d0b4f626/experiments/873add629cf44cd5ab2ef383c94b1c9b/output/execution

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SourOx12
				
					0
					 × 1

Hi SourOx12
I think that you do not actually need this one:
step = step - cfg.start_epoch + 1you can just do
step += 1ClearML Will take care of the offset itself

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 This does not solve the problem unfortunately:( New exp: https://app.community.clear.ml/projects/2d68c58ff6f14403b51ff4c2d0b4f626/experiments/ec096e98ed5c4eccaf8047673023fc3e/output/execution
The image shows the eval log. The second column is val, the third column is step

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SourOx12
				
					0
					 × 1

I'm not sure I follow the example... Are you sure this experiment continued a previous run?
What was the last iteration on the previous run ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14
The last iteration before the restore was 2. Starting from the 3rd iteration, this is the restored model

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SourOx12
				
					0
					 × 1

AgitatedDove14
Can you please give some code examples where the training restore, because I haven't found any? I will be very grateful

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SourOx12
				
					0
					 × 1

Give me a minute, I'll check something

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

SourOx12
Run this example:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py
Once, then change line #26 to:
task = Task.init(project_name="examples", task_name="scalar reporting", continue_last_task=True)and run again,

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 I finally found a solution to the problem. I should have written task.set_initial_iteration(0) after restore task. Thank you for your help

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SourOx12
				
					0
					 × 1

Nice SourOx12 !

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

813 Views

18 Answers

3 years ago

one year ago