Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Is There A Way To Configure A Clearml-Agent So That It Shutdown The Server After It Has Been Idle For A Certain Time Period? We Are Using Gpu Resources From A Provider That Autoscaling Doesn'T Support (Such As Sagemaker Training Jobs).

Is there a way to configure a clearml-agent so that it shutdown the server after it has been idle for a certain time period? We are using GPU resources from a provider that autoscaling doesn't support (such as Sagemaker training jobs).

  
  
Posted 8 months ago
Votes Newest

Answers 20


Hi @<1632913939241111552:profile|HighRaccoon77> , the most 'basic' solution would be adding a piece of code at the end of your script to shut down the machine but obviously it would be unpleasant to run locally without Task.execute_remotely() - None

Are you specifically using Sagemaker? Do you have any api interface you could work with to manipulate shutdown of machines?

  
  
Posted 8 months ago

I guess you could probably introduce some code into the clearml agent as a configuration in clearml.conf or even as a flag in the CLI that would send a shutdown command to the machine once the agent finishes running a job

  
  
Posted 8 months ago

Thanks!

  
  
Posted 8 months ago

Sure. Let me take a look at it.

  
  
Posted 8 months ago

sagemaker runs a train.sh script I provide. I just put clearml-agent commands in that script.

  
  
Posted 8 months ago

Any specific reason not to use the autoscaler? I would imagine it would be even more cost effective

  
  
Posted 8 months ago

And easier to manage without the need for such 'hacks' 😛

  
  
Posted 8 months ago

EC2 is indeed cheaper than Sagemaker tho, and it's supported by autoscaler.

  
  
Posted 8 months ago

With the autoscaler it's also easier to configure a large variety of different compute resources. Although if you're only interested in p4 equivalent instances and on fast demand I can understand the issue

  
  
Posted 8 months ago

It would be the best if autoscaler can support Sagemaker and a few other providers that have better on-demand GPU supplies.

  
  
Posted 8 months ago

And you use the agent to set up the environment for the experiment to run?

  
  
Posted 8 months ago

Keeping machines up for a longer time for a fairly cheaper cost (especially if you're using spot instances)

  
  
Posted 8 months ago

Glad to see it works (thanks for sharing @<1632913939241111552:profile|HighRaccoon77> ).
I have a question on Dynamic GPU allocation , disregarding any autoscaling considerations:
Let’s say we spin up a clearML agent on an 8 GPU instance (via a launcher script as @<1632913939241111552:profile|HighRaccoon77> is doing), with --dynamic-gpus enabled, catering to 2 gpu queue and a 4 gpu queue. The agent pulls in a new task that only requires 2 GPU’s, and while that task is ongoing, a new task that requires 4 GPU’s is placed in the 4 GPU queue. Does the agent need to complete the 2 GPU first task before launching the 4 GPU task? Or can they run concurrently? @<1523701070390366208:profile|CostlyOstrich36>

  
  
Posted 7 months ago

If not, would the right workaround be to launch let’s say 3 different agents from the same launcher script, 2 of them with access to 2 GPU’s (agent1 - gpus 0,1, agent2-2,3), and the other with access to 4 GPU’s (agent3 - gpus 4,5,6,7)? Assuming I want to have more 2 GPU jobs running than 4 GPU jobs.

  
  
Posted 7 months ago

@<1632913939241111552:profile|HighRaccoon77> were you able to make the instance stop after a job launched by the agent was complete?

  
  
Posted 7 months ago

BTW, considering the lower costs of EC2, you could always use longer timeout times for the autoscaler to ensure better availability of machines

  
  
Posted 8 months ago

Maybe even make a PR out of it if you want 🙂

How are you launching the agents?

  
  
Posted 8 months ago

It's more difficult to get p4de quota / capacity from EC2 than Sagemaker.

  
  
Posted 8 months ago

The machine should shutdown automatically once clearml-agent exits.

  
  
Posted 8 months ago

finally, is there any way of limiting the host memory that each task can use?

  
  
Posted 7 months ago
481 Views
20 Answers
8 months ago
7 months ago
Tags