re. "serverless" I mean running a training task on cloud services such that machines with GPUs for those tasks are provisioned on demand.
That means we don't have to keep a pool of machines with GPUs standing by, and don't have to deal with autoscaling. The cloud provider, upon receipt of such a training task, provisions the machines and runs the training.
This is a common use case for example in VertexAI.
Regarding Autoscaling - yes, autoscaling EC2 instances for example based on pending experiments in the ClearML experiments queue.
Even better - if you can autoscale (create and stop) EKS instances.
Hi IcyJellyfish61 , while spinning up and down EKS is not supported (albeit very cool 😄 ) we have an autoscaler in the applications section that does exactly what you need, spin up and down EC2 instances according to demand 🙂
If you're using http://app.clear.ml as you server, you can find it at https://app.clear.ml/applications .
Unfortunately, it is unavailable for the opensource server and only to paid tiers.
Does ClearML support running the experiments on any "serverless" environments
Can you please elaborate by what you mean "serverless"?
such that GPU resources are allocated on demand?
You can define various queues for resources according to whatever structure you want. Does that make sense?
Alternatively, is there a story for auto-scaling GPU machines based on experiments waiting in the queue and some policy?
Do you mean an autoscaler for AWS for example?