Hi OddAlligator72
for instance - remove all the metrics from some step onward?
(I think that as long as the Task is not published you could do such a thing directly with the RestAPI (aka APIClient from python)
What's the use case?
Hey AgitatedDove14 ,
I wish to be able to continue a previous run, but from a certain checkpoint onward (perhaps with changed data, perhaps with different LR...). So I wish to be able to be able to "go back" to the epoch of the checkpoint, and continue from there while retaining the entire history up to that point.
I see, is this what you are looking for?
https://allegro.ai/docs/task.html#trains.task.Task.init
continue_last_task='task_id'
Getting the last checkpoint can be done via.
Task.get_task(task_id='aabbcc').models['output'][-1]
I'm not sure that this is exactly that, though I wish to continue from a given checkpoint.
Also, will this overwrite graphs starting at a given step?
OddAlligator72 let's separate the two issues:
Continue reporting from a previous iteration Retrieving a previously stored checkpointNow for the details:
Are you referring to a scenario where you execute your code manually (i.e. without the trains-agent) ?
Manually should be the simplest, so let's start from there...
OddAlligator72 sure thing 🙂
This should sort it out:Task.init('examples', 'train', continue_last_task=True)
If you want to continue a specific Task:continue_last_task='task_id_here'
Getting the previous model:last_checkopoint = task.models['output'][-1]
What do you think?
That's great for continuing from the last checkpoint, but, unless I misunderstand you, my intention is different:
Suppose I trained a model for 30k epochs over night, and looking at the graphs, I wish to get back to the 22k'th epoch and retrain it from there differently, while preserving all the history up to that point.
So, I start by cloning the task, and.. what can I do then to "get back" to the previous epoch? This means that I would like all metrics, logs, checkpoints, etc. from the 22k'th epoch forward deleted, and then to use your approach.
I see now.
Let's assume you know which snapshot that was:
` prev_task = Task.get_task(task_id='the_first_training_task_id')
get the second from last checkpoint
task.models['output'][-2].url
prev_scalars = prev_task.get_reported_scalars()
new_task = Task.init('example', 'new task')
logger = new_task.get_logger()
do some fpr loop and report the prev_scalars with logger.report_scalars
new_task.flush(wait_for_uploads=True)
new_task.set_initial_iteration(22000)
start the train `
OK, that looks like a nice workaround. Thanks!