Is There A Way To Configure A Clearml-Agent So That It Shutdown The Server After It Has Been Idle For A Certain Time Period? We Are Using Gpu Resources From A Provider That Autoscaling Doesn'T Support (Such As Sagemaker Training Jobs).

Answered

Is there a way to configure a clearml-agent so that it shutdown the server after it has been idle for a certain time period? We are using GPU resources from a provider that autoscaling doesn't support (such as Sagemaker training jobs).

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HighRaccoon77
				
					0
					 × 1

Votes Newest

Answers 20

Hi @<1632913939241111552:profile|HighRaccoon77> , the most 'basic' solution would be adding a piece of code at the end of your script to shut down the machine but obviously it would be unpleasant to run locally without Task.execute_remotely() - None

Are you specifically using Sagemaker? Do you have any api interface you could work with to manipulate shutdown of machines?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

The machine should shutdown automatically once clearml-agent exits.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HighRaccoon77
				
					0
					 × 1

I guess you could probably introduce some code into the clearml agent as a configuration in clearml.conf or even as a flag in the CLI that would send a shutdown command to the machine once the agent finishes running a job

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Sure. Let me take a look at it.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HighRaccoon77
				
					0
					 × 1

Thanks!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HighRaccoon77
				
					0
					 × 1

Maybe even make a PR out of it if you want 🙂

How are you launching the agents?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

sagemaker runs a train.sh script I provide. I just put clearml-agent commands in that script.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HighRaccoon77
				
					0
					 × 1

Any specific reason not to use the autoscaler? I would imagine it would be even more cost effective

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

And easier to manage without the need for such 'hacks' 😛

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

It's more difficult to get p4de quota / capacity from EC2 than Sagemaker.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HighRaccoon77
				
					0
					 × 1

EC2 is indeed cheaper than Sagemaker tho, and it's supported by autoscaler.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HighRaccoon77
				
					0
					 × 1

With the autoscaler it's also easier to configure a large variety of different compute resources. Although if you're only interested in p4 equivalent instances and on fast demand I can understand the issue

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

It would be the best if autoscaler can support Sagemaker and a few other providers that have better on-demand GPU supplies.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HighRaccoon77
				
					0
					 × 1

And you use the agent to set up the environment for the experiment to run?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

BTW, considering the lower costs of EC2, you could always use longer timeout times for the autoscaler to ensure better availability of machines

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Keeping machines up for a longer time for a fairly cheaper cost (especially if you're using spot instances)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Glad to see it works (thanks for sharing @<1632913939241111552:profile|HighRaccoon77> ).
I have a question on Dynamic GPU allocation , disregarding any autoscaling considerations:
Let’s say we spin up a clearML agent on an 8 GPU instance (via a launcher script as @<1632913939241111552:profile|HighRaccoon77> is doing), with --dynamic-gpus enabled, catering to 2 gpu queue and a 4 gpu queue. The agent pulls in a new task that only requires 2 GPU’s, and while that task is ongoing, a new task that requires 4 GPU’s is placed in the 4 GPU queue. Does the agent need to complete the 2 GPU first task before launching the 4 GPU task? Or can they run concurrently? @<1523701070390366208:profile|CostlyOstrich36>

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CurvedOwl19
				
					0

If not, would the right workaround be to launch let’s say 3 different agents from the same launcher script, 2 of them with access to 2 GPU’s (agent1 - gpus 0,1, agent2-2,3), and the other with access to 4 GPU’s (agent3 - gpus 4,5,6,7)? Assuming I want to have more 2 GPU jobs running than 4 GPU jobs.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CurvedOwl19
				
					0

@<1632913939241111552:profile|HighRaccoon77> were you able to make the instance stop after a job launched by the agent was complete?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CurvedOwl19
				
					0

finally, is there any way of limiting the host memory that each task can use?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CurvedOwl19
				
					0

Write your answer

2K Views

20 Answers

one year ago