Does Clearml Support Running The Experiments On Any "Serverless" Environments (I.E. Vertexai, Sagemaker, Etc.), Such That Gpu Resources Are Allocated On Demand?
Alternatively, Is There A Story For Auto-Scaling Gpu Machines Based On Experiments Waiting In
re. "serverless" I mean running a training task on cloud services such that machines with GPUs for those tasks are provisioned on demand.
That means we don't have to keep a pool of machines with GPUs standing by, and don't have to deal with autoscaling. The cloud provider, upon receipt of such a training task, provisions the machines and runs the training.
This is a common use case for example in VertexAI.
Regarding Autoscaling - yes, autoscaling EC2 instances for example based on pending experiments in the ClearML experiments queue.
Even better - if you can autoscale (create and stop) EKS instances.
2 years ago
2 years ago