Keeping machines up for a longer time for a fairly cheaper cost (especially if you're using spot instances)
Maybe even make a PR out of it if you want 🙂
How are you launching the agents?
I guess you could probably introduce some code into the clearml agent as a configuration in clearml.conf
or even as a flag in the CLI that would send a shutdown command to the machine once the agent finishes running a job
And easier to manage without the need for such 'hacks' 😛
sagemaker runs a train.sh
script I provide. I just put clearml-agent
commands in that script.
The machine should shutdown automatically once clearml-agent exits.
If not, would the right workaround be to launch let’s say 3 different agents from the same launcher script, 2 of them with access to 2 GPU’s (agent1 - gpus 0,1, agent2-2,3), and the other with access to 4 GPU’s (agent3 - gpus 4,5,6,7)? Assuming I want to have more 2 GPU jobs running than 4 GPU jobs.
Any specific reason not to use the autoscaler? I would imagine it would be even more cost effective
It's more difficult to get p4de quota / capacity from EC2 than Sagemaker.
BTW, considering the lower costs of EC2, you could always use longer timeout times for the autoscaler to ensure better availability of machines
@<1632913939241111552:profile|HighRaccoon77> were you able to make the instance stop after a job launched by the agent was complete?
Hi @<1632913939241111552:profile|HighRaccoon77> , the most 'basic' solution would be adding a piece of code at the end of your script to shut down the machine but obviously it would be unpleasant to run locally without Task.execute_remotely()
- None
Are you specifically using Sagemaker? Do you have any api interface you could work with to manipulate shutdown of machines?
With the autoscaler it's also easier to configure a large variety of different compute resources. Although if you're only interested in p4 equivalent instances and on fast demand I can understand the issue
It would be the best if autoscaler can support Sagemaker and a few other providers that have better on-demand GPU supplies.
EC2 is indeed cheaper than Sagemaker tho, and it's supported by autoscaler.
Glad to see it works (thanks for sharing @<1632913939241111552:profile|HighRaccoon77> ).
I have a question on Dynamic GPU allocation , disregarding any autoscaling considerations:
Let’s say we spin up a clearML agent on an 8 GPU instance (via a launcher script as @<1632913939241111552:profile|HighRaccoon77> is doing), with --dynamic-gpus enabled, catering to 2 gpu queue and a 4 gpu queue. The agent pulls in a new task that only requires 2 GPU’s, and while that task is ongoing, a new task that requires 4 GPU’s is placed in the 4 GPU queue. Does the agent need to complete the 2 GPU first task before launching the 4 GPU task? Or can they run concurrently? @<1523701070390366208:profile|CostlyOstrich36>
finally, is there any way of limiting the host memory that each task can use?
And you use the agent to set up the environment for the experiment to run?