OddAlligator72 let's separate the two issues:
Continue reporting from a previous iteration Retrieving a previously stored checkpointNow for the details:
Are you referring to a scenario where you execute your code manually (i.e. without the trains-agent) ?
That's great for continuing from the last checkpoint, but, unless I misunderstand you, my intention is different:
Suppose I trained a model for 30k epochs over night, and looking at the graphs, I wish to get back to the 22k'th epoch and retrain it from there differently, while preserving all the history up to that point.
So, I start by cloning the task, and.. what can I do then to "get back" to the previous epoch? This means that I would like all metrics, logs, checkpoints, etc. from the 22k'th epoch forward deleted, and then to use your approach.
I see now.
Let's assume you know which snapshot that was:
` prev_task = Task.get_task(task_id='the_first_training_task_id')
get the second from last checkpoint
prev_scalars = prev_task.get_reported_scalars()
new_task = Task.init('example', 'new task')
logger = new_task.get_logger()
do some fpr loop and report the prev_scalars with logger.report_scalars
start the train `
OddAlligator72 sure thing 🙂
This should sort it out:
Task.init('examples', 'train', continue_last_task=True)If you want to continue a specific Task:
continue_last_task='task_id_here'Getting the previous model:
last_checkopoint = task.models['output'][-1]What do you think?
Hey AgitatedDove14 ,
I wish to be able to continue a previous run, but from a certain checkpoint onward (perhaps with changed data, perhaps with different LR...). So I wish to be able to be able to "go back" to the epoch of the checkpoint, and continue from there while retaining the entire history up to that point.
I see, is this what you are looking for?