I'm not sure that this is exactly that, though I wish to continue from a given checkpoint.
Also, will this overwrite graphs starting at a given step?
OddAlligator72 sure thing 🙂
This should sort it out:Task.init('examples', 'train', continue_last_task=True)
If you want to continue a specific Task:continue_last_task='task_id_here'
Getting the previous model:last_checkopoint = task.models['output'][-1]
What do you think?
Getting the last checkpoint can be done via.
Task.get_task(task_id='aabbcc').models['output'][-1]
Hi OddAlligator72
for instance - remove all the metrics from some step onward?Â
(I think that as long as the Task is not published you could do such a thing directly with the RestAPI (aka APIClient from python)
What's the use case?
Hey AgitatedDove14 ,
I wish to be able to continue a previous run, but from a certain checkpoint onward (perhaps with changed data, perhaps with different LR...). So I wish to be able to be able to "go back" to the epoch of the checkpoint, and continue from there while retaining the entire history up to that point.
OddAlligator72 let's separate the two issues:
Continue reporting from a previous iteration Retrieving a previously stored checkpointNow for the details:
Are you referring to a scenario where you execute your code manually (i.e. without the trains-agent) ?
That's great for continuing from the last checkpoint, but, unless I misunderstand you, my intention is different:
Suppose I trained a model for 30k epochs over night, and looking at the graphs, I wish to get back to the 22k'th epoch and retrain it from there differently, while preserving all the history up to that point.
So, I start by cloning the task, and.. what can I do then to "get back" to the previous epoch? This means that I would like all metrics, logs, checkpoints, etc. from the 22k'th epoch forward deleted, and then to use your approach.
I see now.
Let's assume you know which snapshot that was:
` prev_task = Task.get_task(task_id='the_first_training_task_id')
get the second from last checkpoint
task.models['output'][-2].url
prev_scalars = prev_task.get_reported_scalars()
new_task = Task.init('example', 'new task')
logger = new_task.get_logger()
do some fpr loop and report the prev_scalars with logger.report_scalars
new_task.flush(wait_for_uploads=True)
new_task.set_initial_iteration(22000)
start the train `
OK, that looks like a nice workaround. Thanks!
Manually should be the simplest, so let's start from there...
I see, is this what you are looking for?
https://allegro.ai/docs/task.html#trains.task.Task.init
continue_last_task='task_id'