Unanswered
Hi Team! Is There A Way To Make Clearml’S Aws Autoscaler And Queues Resource-Aware Please? I.E. If We Can Say, As We Enqueue Our Job, How Much Ram Or Gpu-Ram Or Even Gpus It Needs, Have The Scheduler/Autoscaler Dispatch The Job To Instances That Are Of Th
As an infrastructure engineer, I feel that this is a fairly significant shortcoming of ClearML.
Having the ability to pack jobs/tasks onto the same "resource" (underlying server/EC2 instance) would
- simplify the experience for data scientists
- open up a streaming use case, wherein batch (offline) inference could be done directly inside of a ClearML pipeline in reaction to an event/trigger (like new data landing in your data lake). As it is, you can make this work, but if you start to get a high volume of events, you'd either be autoscaling until you were broke (one ec2 instance per streaming event :shocked_face_with_exploding_head: ), or your queue would have an impossibly long wait time- automatic retraining would have similar problems at high volume, though re-training a model is probably a much lower-volume type of work than streaming
That said, credit where credit is due: it's pretty amazing that ClearML allows you to orchestrate compute in a self-hosted manner without needing to have Kubernetes expertise on your team.
185 Views
0
Answers
one year ago
one year ago