We have tried to manually restart tasks reloading all the scalars from a dead task and loading latest saved torch model.
Hi ThickKitten19
how did you try to restart them ? how are you monitoring dying instances ? where . how they are running?
Hi AgitatedDove14 I get the reported scalars from the web usingmodel_task = Task.get_task(task_id=model_task_id) scalars = model_task.get_reported_scalars()
then register each of the scalars with something likelogger.report_scalar(title=metric_key, series=series_val['name'], value=y, iteration=x)
Then you have reported scalars to which I am able to append rest of the model training reports.
Workers are running across multiple machines and you can monitor if a task is dead by looking at the web page.
how did you try to restart them ?
Yes, but how did you restart the agent on the remote machine ?
AgitatedDove14 I am not restarting the agent itself, I just need to be able continue the experiment from the same progress point. It can be a different agent. In fact, I am just loading the progress to another agent within the available queue.
It can be a different agent.
If inside a docker thenclearml-agent execute --id <task_id here> --docker
If you need venv doclearml-agent execute --id <task_id here>
You can run that on any machine and it will respin and continue your Task
(obviously your code needs to be aware of that and be able to pull its own last model checkpoint from the Task artifacts / models)
Is this what you are after?
AgitatedDove14 Let me clarify I think you have misunderstood me.
The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.
However, the importance of the experiment is low so when other, more important experiments appear. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.
Later, when more important experiments has been completed, we can continue HPO task from the same state.
Hope this makes the problem more clear.
The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.
Good point!
. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.
Oh I see now....
Later, when more important experiments has been completed, we can continue HPO task from the same state.
Quick question when you say the HPO Task, you mean the HPO controller logic Task (i.e. the one launching the training jobs), or do you mean the actual training job itself (i.e. running with a specific set of parameters decided by the HPO controlling task) ?
Quick question when you say the HPO Task, you mean the HPO controller logic Task (i.e. the one launching the training jobs), or do you mean the actual training job itself (i.e. running with a specific set of parameters decided by the HPO controlling task) ?
AgitatedDove14 Sorry, my bad! By HPO task
I mean the actual training job itself.
We run the HPO controller logic Task on a separate cpu only machine, so we can think that this task is always on. Only the training jobs can go offline(for the above mentioned reasons)
okay that makes sense, if this is the case I would just use clearml-agent execute --id <task_id here>
to continue the training Task.
Do notice you have to reload your last chekcpoint from the Task's models/artifacts to continue 🙂
Last question, what is the HPO optimization algorithm, is it just grid/random search or optuna hbop/optuna, if this is the later, how do make it "continue" ?
I see! Then the command clearml-agent execute --id <task_id here>
should reload the reported scalars and the task needs to reload last checkpoints only, right?
That's good question too! We didn't figure out the best way of continuing for both the grid and optuna. Can you suggest something?
should reload the reported scalars
Exactly (notice it also understand when was the last report of scalars so it should automatically increase the iterations (i.e. you will not accidentally overwrite previously reported scalars)
and the task needs to reload last checkpoints only, right?
Correct 🙂
We didn't figure out the best way of continuing for both the grid and optuna. Can you suggest something?
That is a good point, not sure if we have a GH issue, for that but worth checking and if not opening one, it should not be difficult to serialize/deserialize the internal step of the HPO process.
When this will be implemented you could use the same "clearml-agent execute" to relaunch the HPO process as well
wdyt?
Thanks for the answers AgitatedDove14 .
I will look GH issues in and open one if there isn't related one.