Hi Folks, Is It Possible To Use An Aws P3 Instance (Which As Several Gpus) With One Agent Per Gpu, All Controlled Through Clearml Aws Autoscheduler? So Clearml Aws Autoscheduler Would Know In Advance How Much Agents To Start In The Instances (Can Be An Op

Answered

Hi folks, Is it possible to use an aws p3 instance (which as several GPUs) with one agent per GPU, all controlled through ClearML AWS AutoScheduler?
So ClearML aws AutoScheduler would know in advance how much agents to start in the instances (can be an option: "one per GPU available in the machine") and then only kill the machine if all GPUs are idle for more than x mins.

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 5

JitteryCoyote63 Hmmm in theory, yes.
In practice you need to change this line:
https://github.com/allegroai/clearml/blob/fbbae0b8bc933fbbb9811faeabb9b6d9a0ea8d97/clearml/automation/aws_auto_scaler.py#L78
` python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker} --gpus 0 --detached

python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker} --gpus 1 --detached

python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker} --gpus 2 --detached

...

python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker} --gpus 7 Notice the last line should not have --docker `
I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

if I encounter the need for that, I will adapt and open a PR

Great!

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

WDYT?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Notice the last line should not have

--docker

Did you meant --detached ?

I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)

That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR 🙂

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Did you meant

--detached

?

Oops yes sorry you are correct should be --detached 🙂

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

5 Answers

4 years ago

2 years ago