Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Is There A Way To Configure A Clearml-Agent So That It Shutdown The Server After It Has Been Idle For A Certain Time Period? We Are Using Gpu Resources From A Provider That Autoscaling Doesn'T Support (Such As Sagemaker Training Jobs).

Is there a way to configure a clearml-agent so that it shutdown the server after it has been idle for a certain time period? We are using GPU resources from a provider that autoscaling doesn't support (such as Sagemaker training jobs).

  
  
Posted 11 months ago
Votes Newest

Answers 20


It would be the best if autoscaler can support Sagemaker and a few other providers that have better on-demand GPU supplies.

  
  
Posted 11 months ago

With the autoscaler it's also easier to configure a large variety of different compute resources. Although if you're only interested in p4 equivalent instances and on fast demand I can understand the issue

  
  
Posted 11 months ago

If not, would the right workaround be to launch let’s say 3 different agents from the same launcher script, 2 of them with access to 2 GPU’s (agent1 - gpus 0,1, agent2-2,3), and the other with access to 4 GPU’s (agent3 - gpus 4,5,6,7)? Assuming I want to have more 2 GPU jobs running than 4 GPU jobs.

  
  
Posted 11 months ago

sagemaker runs a train.sh script I provide. I just put clearml-agent commands in that script.

  
  
Posted 11 months ago

The machine should shutdown automatically once clearml-agent exits.

  
  
Posted 11 months ago

Hi @<1632913939241111552:profile|HighRaccoon77> , the most 'basic' solution would be adding a piece of code at the end of your script to shut down the machine but obviously it would be unpleasant to run locally without Task.execute_remotely() - None

Are you specifically using Sagemaker? Do you have any api interface you could work with to manipulate shutdown of machines?

  
  
Posted 11 months ago

BTW, considering the lower costs of EC2, you could always use longer timeout times for the autoscaler to ensure better availability of machines

  
  
Posted 11 months ago

Glad to see it works (thanks for sharing @<1632913939241111552:profile|HighRaccoon77> ).
I have a question on Dynamic GPU allocation , disregarding any autoscaling considerations:
Let’s say we spin up a clearML agent on an 8 GPU instance (via a launcher script as @<1632913939241111552:profile|HighRaccoon77> is doing), with --dynamic-gpus enabled, catering to 2 gpu queue and a 4 gpu queue. The agent pulls in a new task that only requires 2 GPU’s, and while that task is ongoing, a new task that requires 4 GPU’s is placed in the 4 GPU queue. Does the agent need to complete the 2 GPU first task before launching the 4 GPU task? Or can they run concurrently? @<1523701070390366208:profile|CostlyOstrich36>

  
  
Posted 11 months ago

EC2 is indeed cheaper than Sagemaker tho, and it's supported by autoscaler.

  
  
Posted 11 months ago

@<1632913939241111552:profile|HighRaccoon77> were you able to make the instance stop after a job launched by the agent was complete?

  
  
Posted 11 months ago

finally, is there any way of limiting the host memory that each task can use?

  
  
Posted 11 months ago

Keeping machines up for a longer time for a fairly cheaper cost (especially if you're using spot instances)

  
  
Posted 11 months ago

And you use the agent to set up the environment for the experiment to run?

  
  
Posted 11 months ago

I guess you could probably introduce some code into the clearml agent as a configuration in clearml.conf or even as a flag in the CLI that would send a shutdown command to the machine once the agent finishes running a job

  
  
Posted 11 months ago

Any specific reason not to use the autoscaler? I would imagine it would be even more cost effective

  
  
Posted 11 months ago

And easier to manage without the need for such 'hacks' 😛

  
  
Posted 11 months ago

Sure. Let me take a look at it.

  
  
Posted 11 months ago

Maybe even make a PR out of it if you want 🙂

How are you launching the agents?

  
  
Posted 11 months ago

Thanks!

  
  
Posted 11 months ago

It's more difficult to get p4de quota / capacity from EC2 than Sagemaker.

  
  
Posted 11 months ago
754 Views
20 Answers
11 months ago
11 months ago
Tags