Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Is There A Way To Configure A Clearml-Agent So That It Shutdown The Server After It Has Been Idle For A Certain Time Period? We Are Using Gpu Resources From A Provider That Autoscaling Doesn'T Support (Such As Sagemaker Training Jobs).

Is there a way to configure a clearml-agent so that it shutdown the server after it has been idle for a certain time period? We are using GPU resources from a provider that autoscaling doesn't support (such as Sagemaker training jobs).

  
  
Posted 22 days ago
Votes Newest

Answers 20


Maybe even make a PR out of it if you want 🙂

How are you launching the agents?

  
  
Posted 22 days ago

sagemaker runs a train.sh script I provide. I just put clearml-agent commands in that script.

  
  
Posted 22 days ago

Glad to see it works (thanks for sharing @<1632913939241111552:profile|HighRaccoon77> ).
I have a question on Dynamic GPU allocation , disregarding any autoscaling considerations:
Let’s say we spin up a clearML agent on an 8 GPU instance (via a launcher script as @<1632913939241111552:profile|HighRaccoon77> is doing), with --dynamic-gpus enabled, catering to 2 gpu queue and a 4 gpu queue. The agent pulls in a new task that only requires 2 GPU’s, and while that task is ongoing, a new task that requires 4 GPU’s is placed in the 4 GPU queue. Does the agent need to complete the 2 GPU first task before launching the 4 GPU task? Or can they run concurrently? @<1523701070390366208:profile|CostlyOstrich36>

  
  
Posted 6 days ago

@<1632913939241111552:profile|HighRaccoon77> were you able to make the instance stop after a job launched by the agent was complete?

  
  
Posted 6 days ago

If not, would the right workaround be to launch let’s say 3 different agents from the same launcher script, 2 of them with access to 2 GPU’s (agent1 - gpus 0,1, agent2-2,3), and the other with access to 4 GPU’s (agent3 - gpus 4,5,6,7)? Assuming I want to have more 2 GPU jobs running than 4 GPU jobs.

  
  
Posted 6 days ago

With the autoscaler it's also easier to configure a large variety of different compute resources. Although if you're only interested in p4 equivalent instances and on fast demand I can understand the issue

  
  
Posted 22 days ago

Hi @<1632913939241111552:profile|HighRaccoon77> , the most 'basic' solution would be adding a piece of code at the end of your script to shut down the machine but obviously it would be unpleasant to run locally without Task.execute_remotely() - None

Are you specifically using Sagemaker? Do you have any api interface you could work with to manipulate shutdown of machines?

  
  
Posted 22 days ago

And you use the agent to set up the environment for the experiment to run?

  
  
Posted 22 days ago

finally, is there any way of limiting the host memory that each task can use?

  
  
Posted 6 days ago

I guess you could probably introduce some code into the clearml agent as a configuration in clearml.conf or even as a flag in the CLI that would send a shutdown command to the machine once the agent finishes running a job

  
  
Posted 22 days ago

BTW, considering the lower costs of EC2, you could always use longer timeout times for the autoscaler to ensure better availability of machines

  
  
Posted 22 days ago

And easier to manage without the need for such 'hacks' 😛

  
  
Posted 22 days ago

It's more difficult to get p4de quota / capacity from EC2 than Sagemaker.

  
  
Posted 22 days ago

Keeping machines up for a longer time for a fairly cheaper cost (especially if you're using spot instances)

  
  
Posted 22 days ago

Any specific reason not to use the autoscaler? I would imagine it would be even more cost effective

  
  
Posted 22 days ago

It would be the best if autoscaler can support Sagemaker and a few other providers that have better on-demand GPU supplies.

  
  
Posted 22 days ago

The machine should shutdown automatically once clearml-agent exits.

  
  
Posted 22 days ago

Thanks!

  
  
Posted 22 days ago

Sure. Let me take a look at it.

  
  
Posted 22 days ago

EC2 is indeed cheaper than Sagemaker tho, and it's supported by autoscaler.

  
  
Posted 22 days ago
90 Views
20 Answers
22 days ago
6 days ago
Tags