Hello, Everyone! I Have A Question Regarding Clearml Features. We Run Into The Situation When Some Of The Agents That Are Working On A Hpo Die Due To Variable Reasons. Some Workers Go Offline Or Resources Need Temporarily Be Detached For Other Needs. Thu

Answered

Hello, everyone!
I have a question regarding ClearML features.

We run into the situation when some of the agents that are working on a HPO die due to variable reasons. Some workers go offline or resources need temporarily be detached for other needs.
Thus, we are looking for a way to resurrect dead(or manually stopped) experiments (not from scratch but from last available point) such that the main HPO task will be able to aggregate the report summaries.

We have tried to manually restart tasks reloading all the scalars from a dead task and loading latest saved torch model.
However, this method generates new task with a different task id and the main HPO task is unable to track those reports unless we restart the whole experiment all over again.

To summarize we are looking for the following functionalities:
Restart a dead experiment reloading last model states and append them to running HPO task. A way to stop and restart model experiments at any given moment for proper resource utilization.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ThickKitten19
				
					0
					 × 1

Votes Newest

Answers 12

We have tried to manually restart tasks reloading all the scalars from a dead task and loading latest saved torch model.

Hi ThickKitten19
how did you try to restart them ? how are you monitoring dying instances ? where . how they are running?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 I get the reported scalars from the web using
model_task = Task.get_task(task_id=model_task_id) scalars = model_task.get_reported_scalars()then register each of the scalars with something like
logger.report_scalar(title=metric_key, series=series_val['name'], value=y, iteration=x)Then you have reported scalars to which I am able to append rest of the model training reports.
Workers are running across multiple machines and you can monitor if a task is dead by looking at the web page.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ThickKitten19
				
					0
					 × 1

how did you try to restart them ?

Yes, but how did you restart the agent on the remote machine ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 I am not restarting the agent itself, I just need to be able continue the experiment from the same progress point. It can be a different agent. In fact, I am just loading the progress to another agent within the available queue.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ThickKitten19
				
					0
					 × 1

It can be a different agent.

If inside a docker then
clearml-agent execute --id <task_id here> --dockerIf you need venv do
clearml-agent execute --id <task_id here>You can run that on any machine and it will respin and continue your Task
(obviously your code needs to be aware of that and be able to pull its own last model checkpoint from the Task artifacts / models)
Is this what you are after?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Let me clarify I think you have misunderstood me.

The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.
However, the importance of the experiment is low so when other, more important experiments appear. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.
Later, when more important experiments has been completed, we can continue HPO task from the same state.
Hope this makes the problem more clear.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ThickKitten19
				
					0
					 × 1

The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.

Good point!

. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.

Oh I see now....

Later, when more important experiments has been completed, we can continue HPO task from the same state.

Quick question when you say the HPO Task, you mean the HPO controller logic Task (i.e. the one launching the training jobs), or do you mean the actual training job itself (i.e. running with a specific set of parameters decided by the HPO controlling task) ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Quick question when you say the HPO Task, you mean the HPO controller logic Task (i.e. the one launching the training jobs), or do you mean the actual training job itself (i.e. running with a specific set of parameters decided by the HPO controlling task) ?

AgitatedDove14 Sorry, my bad! By HPO task I mean the actual training job itself.
We run the HPO controller logic Task on a separate cpu only machine, so we can think that this task is always on. Only the training jobs can go offline(for the above mentioned reasons)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ThickKitten19
				
					0
					 × 1

okay that makes sense, if this is the case I would just use clearml-agent execute --id <task_id here> to continue the training Task.
Do notice you have to reload your last chekcpoint from the Task's models/artifacts to continue 🙂
Last question, what is the HPO optimization algorithm, is it just grid/random search or optuna hbop/optuna, if this is the later, how do make it "continue" ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I see! Then the command clearml-agent execute --id <task_id here> should reload the reported scalars and the task needs to reload last checkpoints only, right?

That's good question too! We didn't figure out the best way of continuing for both the grid and optuna. Can you suggest something?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ThickKitten19
				
					0
					 × 1

should reload the reported scalars

Exactly (notice it also understand when was the last report of scalars so it should automatically increase the iterations (i.e. you will not accidentally overwrite previously reported scalars)

and the task needs to reload last checkpoints only, right?

Correct 🙂

We didn't figure out the best way of continuing for both the grid and optuna. Can you suggest something?

That is a good point, not sure if we have a GH issue, for that but worth checking and if not opening one, it should not be difficult to serialize/deserialize the internal step of the HPO process.
When this will be implemented you could use the same "clearml-agent execute" to relaunch the HPO process as well
wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks for the answers AgitatedDove14 .
I will look GH issues in and open one if there isn't related one.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ThickKitten19
				
					0
					 × 1

Write your answer

2K Views

12 Answers

2 years ago