AgitatedDove14
Yes, i use continue_last_task with reuse_last_task_id. The iteration number is the actual number of batches that were used, or the number of the epoch at which the training stopped. The iterations are served sequentially, but for some reason there is a gap in this picture
Hmm, I see the jump from 50 to 100, is that consistent with the last iteration on the aborted Task (before continuing )?
Hi SourOx12
How do you set the iteration when you continue the experiment? is it with Task.init
continue_last_task
?
SourOx12
Hmmm. So if last iteration was 75, the next iteration (after we continue) will be 150 ?
AgitatedDove14 This does not solve the problem unfortunately:( New exp: https://app.community.clear.ml/projects/2d68c58ff6f14403b51ff4c2d0b4f626/experiments/ec096e98ed5c4eccaf8047673023fc3e/output/execution
The image shows the eval log. The second column is val, the third column is step
So the thing is clearml
automatically detects the last iteration of the previous run, my assumption you also add it hence the double shift.
SourOx12 could that be it?
Hi SourOx12
I think that you do not actually need this one:step = step - cfg.start_epoch + 1
you can just dostep += 1
ClearML Will take care of the offset itself
SourOx12
Run this example:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py
Once, then change line #26 to:task = Task.init(project_name="examples", task_name="scalar reporting", continue_last_task=True)
and run again,
I'm not sure I follow the example... Are you sure this experiment continued a previous run?
What was the last iteration on the previous run ?
Sorry to answer so late AgitatedDove14
I also thought so and tried this thing:
` !pip install clearml
import clearml
id_last_start = '873add629cf44cd5ab2ef383c94b1c'
clearml.Task.set_credentials(...)
if id_last_start != '':
task = clearml.Task.get_task(task_id=id_last_start,project_name='tests', task_name='patience: 5 factor:0.5')
task = clearml.Task.init(project_name='Exp with ROP',
task_name='patience: 2 factor:0.75',
continue_last_task=True,
reuse_last_task_id=id_last_start,
)
else:
task = clearml.Task.init(project_name='tests', task_name='patience: 2 factor:0.75')
cfg.task = task
cfg.my_writer =task.get_logger()
def to_logger(path,step,val,cfg):
folder = path[:path.find('/')]
file = path[path.find('/')+1:]
if step == cfg.epoch:
step = step - cfg.start_epoch + 1
print(step,val)
clearml.Logger.current_logger().report_scalar(folder, file, iteration=step, value=val)
elif step == cfg.step:
step = step - cfg.start_step + 1
clearml.Logger.current_logger().report_scalar(folder, file, iteration=step, value=val) `And after restarting, I get these breaks in Scalars: https://app.community.clear.ml/projects/2d68c58ff6f14403b51ff4c2d0b4f626/experiments/873add629cf44cd5ab2ef383c94b1c9b/output/execution