Hi @<1632913939241111552:profile|HighRaccoon77> , the most 'basic' solution would be adding a piece of code at the end of your script to shut down the machine but obviously it would be unpleasant to run locally without Task.execute_remotely()
- None
Are you specifically using Sagemaker? Do you have any api interface you could work with to manipulate shutdown of machines?
It would be the best if autoscaler can support Sagemaker and a few other providers that have better on-demand GPU supplies.
Any specific reason not to use the autoscaler? I would imagine it would be even more cost effective
Glad to see it works (thanks for sharing @<1632913939241111552:profile|HighRaccoon77> ).
I have a question on Dynamic GPU allocation , disregarding any autoscaling considerations:
Let’s say we spin up a clearML agent on an 8 GPU instance (via a launcher script as @<1632913939241111552:profile|HighRaccoon77> is doing), with --dynamic-gpus enabled, catering to 2 gpu queue and a 4 gpu queue. The agent pulls in a new task that only requires 2 GPU’s, and while that task is ongoing, a new task that requires 4 GPU’s is placed in the 4 GPU queue. Does the agent need to complete the 2 GPU first task before launching the 4 GPU task? Or can they run concurrently? @<1523701070390366208:profile|CostlyOstrich36>
And you use the agent to set up the environment for the experiment to run?
If not, would the right workaround be to launch let’s say 3 different agents from the same launcher script, 2 of them with access to 2 GPU’s (agent1 - gpus 0,1, agent2-2,3), and the other with access to 4 GPU’s (agent3 - gpus 4,5,6,7)? Assuming I want to have more 2 GPU jobs running than 4 GPU jobs.
finally, is there any way of limiting the host memory that each task can use?
Maybe even make a PR out of it if you want 🙂
How are you launching the agents?
@<1632913939241111552:profile|HighRaccoon77> were you able to make the instance stop after a job launched by the agent was complete?
Keeping machines up for a longer time for a fairly cheaper cost (especially if you're using spot instances)
I guess you could probably introduce some code into the clearml agent as a configuration in clearml.conf
or even as a flag in the CLI that would send a shutdown command to the machine once the agent finishes running a job
sagemaker runs a train.sh
script I provide. I just put clearml-agent
commands in that script.
It's more difficult to get p4de quota / capacity from EC2 than Sagemaker.
With the autoscaler it's also easier to configure a large variety of different compute resources. Although if you're only interested in p4 equivalent instances and on fast demand I can understand the issue
BTW, considering the lower costs of EC2, you could always use longer timeout times for the autoscaler to ensure better availability of machines
The machine should shutdown automatically once clearml-agent exits.
And easier to manage without the need for such 'hacks' 😛
EC2 is indeed cheaper than Sagemaker tho, and it's supported by autoscaler.