Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Is There A Way To Configure A Clearml-Agent So That It Shutdown The Server After It Has Been Idle For A Certain Time Period? We Are Using Gpu Resources From A Provider That Autoscaling Doesn'T Support (Such As Sagemaker Training Jobs).

Is there a way to configure a clearml-agent so that it shutdown the server after it has been idle for a certain time period? We are using GPU resources from a provider that autoscaling doesn't support (such as Sagemaker training jobs).

  
  
Posted one year ago
Votes Newest

Answers 20


With the autoscaler it's also easier to configure a large variety of different compute resources. Although if you're only interested in p4 equivalent instances and on fast demand I can understand the issue

  
  
Posted one year ago

It's more difficult to get p4de quota / capacity from EC2 than Sagemaker.

  
  
Posted one year ago

Hi @<1632913939241111552:profile|HighRaccoon77> , the most 'basic' solution would be adding a piece of code at the end of your script to shut down the machine but obviously it would be unpleasant to run locally without Task.execute_remotely() - None

Are you specifically using Sagemaker? Do you have any api interface you could work with to manipulate shutdown of machines?

  
  
Posted one year ago

BTW, considering the lower costs of EC2, you could always use longer timeout times for the autoscaler to ensure better availability of machines

  
  
Posted one year ago

It would be the best if autoscaler can support Sagemaker and a few other providers that have better on-demand GPU supplies.

  
  
Posted one year ago

sagemaker runs a train.sh script I provide. I just put clearml-agent commands in that script.

  
  
Posted one year ago

Sure. Let me take a look at it.

  
  
Posted one year ago

Any specific reason not to use the autoscaler? I would imagine it would be even more cost effective

  
  
Posted one year ago

And easier to manage without the need for such 'hacks' 😛

  
  
Posted one year ago

EC2 is indeed cheaper than Sagemaker tho, and it's supported by autoscaler.

  
  
Posted one year ago

Glad to see it works (thanks for sharing @<1632913939241111552:profile|HighRaccoon77> ).
I have a question on Dynamic GPU allocation , disregarding any autoscaling considerations:
Let’s say we spin up a clearML agent on an 8 GPU instance (via a launcher script as @<1632913939241111552:profile|HighRaccoon77> is doing), with --dynamic-gpus enabled, catering to 2 gpu queue and a 4 gpu queue. The agent pulls in a new task that only requires 2 GPU’s, and while that task is ongoing, a new task that requires 4 GPU’s is placed in the 4 GPU queue. Does the agent need to complete the 2 GPU first task before launching the 4 GPU task? Or can they run concurrently? @<1523701070390366208:profile|CostlyOstrich36>

  
  
Posted one year ago

I guess you could probably introduce some code into the clearml agent as a configuration in clearml.conf or even as a flag in the CLI that would send a shutdown command to the machine once the agent finishes running a job

  
  
Posted one year ago

The machine should shutdown automatically once clearml-agent exits.

  
  
Posted one year ago

Thanks!

  
  
Posted one year ago

And you use the agent to set up the environment for the experiment to run?

  
  
Posted one year ago

If not, would the right workaround be to launch let’s say 3 different agents from the same launcher script, 2 of them with access to 2 GPU’s (agent1 - gpus 0,1, agent2-2,3), and the other with access to 4 GPU’s (agent3 - gpus 4,5,6,7)? Assuming I want to have more 2 GPU jobs running than 4 GPU jobs.

  
  
Posted one year ago

finally, is there any way of limiting the host memory that each task can use?

  
  
Posted one year ago

Maybe even make a PR out of it if you want 🙂

How are you launching the agents?

  
  
Posted one year ago

@<1632913939241111552:profile|HighRaccoon77> were you able to make the instance stop after a job launched by the agent was complete?

  
  
Posted one year ago

Keeping machines up for a longer time for a fairly cheaper cost (especially if you're using spot instances)

  
  
Posted one year ago
877 Views
20 Answers
one year ago
one year ago
Tags