That should not be complicated to implement. Basically you could run 'clearm-task execute --id taskid' as the sagemaker cmd. Can you manually launch it on sagemaker?
Sagemaker will make that easy, especially if I have sagemaker as the long tail choice. Granted at a higher cost
I think my main point is, k8s glue on aks or gke basically takes care of spinning new nodes, as the k8s service does that. Aws autoscaler is kind of a replacement , make sense?
For different workloads, I need to habe different cluster scaler rules and account for different gpu needs
AgitatedDove14 - i had not used the autoscaler since it asks for access key. Mainly looking for GPU use cases - with sagemaker one can choose any instance they want and use it, autoscaler would need set instance configured right? need to revisit. Also I want to use the k8s glue if not for this. Suggestions?
As in if there are jobs, first level is new pods, second level is new nodes in the cluster.
BTW is it cheaper than ec2 instance? Why not use the aws autoscaler ?
Got it. Never ran GPU workload in EKS before. Do you have any experience and things to watch out for?
Running multiple k8s_daemon rightt? k8s_daemon("1xGPU")
and k8s_daemon('cpu')
right?
Aws autoscaler will work with iam rules along as you have it configured on the machine itself. Sagemaker job scheduling (I'm assuming this is what you are referring to, and not the notebook) you need to select the instance as well (basically the same as ec2). What do you mean by using the k8s glue, like inherit and implement the same mechanism but for sagemaker I stead of kubectl ?
Do you have any experience and things to watch out for?
Yes, for testing start with cheap node instances 🙂
If I remember correctly everything is preconfigured to support GPU instances (aka nvidia runtime).
You can take one of the templates from here as a starting point:
https://aws.amazon.com/blogs/compute/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks/
AgitatedDove14 - any pointers on how to run gpu tasks with k8s glue. How to control the queue and differentiate tasks that need cpu vs gpu in this context
Basic setup:
glues service per "job template" (e.g. k8s resources, for example cpu requirement, or gpu requirement).
queue per glue service, e.g. cpu_machine
queue, and 1xGPU
queue
wdyt?
AgitatedDove14 aws autoscaler is not k8s native right? That's sort of the loose point I am coming at.