ClearML FAQ | Anyone Doing Sagemaker With Clearml - Something Like The K8S Glue But The Tasks Are Pulled Into Sagemaker Training Jobs

Answered

Anyone Doing Sagemaker With Clearml - Something Like The K8S Glue But The Tasks Are Pulled Into Sagemaker Training Jobs

Anyone doing sagemaker with Clearml - something like the k8s glue but the tasks are pulled into sagemaker training jobs

  				
Posted 
	3 years ago

					More  		
  Report
		
					TrickySheep9
				
					0
					 × 1

Votes Newest

Answers 15

Sagemaker will make that easy, especially if I have sagemaker as the long tail choice. Granted at a higher cost

  				
Posted 
	3 years ago

					More  		
  Report
		
					TrickySheep9
				
					0
					 × 1

AgitatedDove14 aws autoscaler is not k8s native right? That's sort of the loose point I am coming at.

  				
Posted 
	3 years ago

					More  		
  Report
		
					TrickySheep9
				
					0
					 × 1

That should not be complicated to implement. Basically you could run 'clearm-task execute --id taskid' as the sagemaker cmd. Can you manually launch it on sagemaker?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

For different workloads, I need to habe different cluster scaler rules and account for different gpu needs

  				
Posted 
	3 years ago

					More  		
  Report
		
					TrickySheep9
				
					0
					 × 1

Do you have any experience and things to watch out for?

Yes, for testing start with cheap node instances 🙂
If I remember correctly everything is preconfigured to support GPU instances (aka nvidia runtime).
You can take one of the templates from here as a starting point:
https://aws.amazon.com/blogs/compute/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks/

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Basic setup:
glues service per "job template" (e.g. k8s resources, for example cpu requirement, or gpu requirement).
queue per glue service, e.g. cpu_machine queue, and 1xGPU queue
wdyt?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Got it. Never ran GPU workload in EKS before. Do you have any experience and things to watch out for?

  				
Posted 
	3 years ago

					More  		
  Report
		
					TrickySheep9
				
					0
					 × 1

BTW is it cheaper than ec2 instance? Why not use the aws autoscaler ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 - i had not used the autoscaler since it asks for access key. Mainly looking for GPU use cases - with sagemaker one can choose any instance they want and use it, autoscaler would need set instance configured right? need to revisit. Also I want to use the k8s glue if not for this. Suggestions?

  				
Posted 
	3 years ago

					More  		
  Report
		
					TrickySheep9
				
					0
					 × 1

AgitatedDove14 - any pointers on how to run gpu tasks with k8s glue. How to control the queue and differentiate tasks that need cpu vs gpu in this context

  				
Posted 
	3 years ago

					More  		
  Report
		
					TrickySheep9
				
					0
					 × 1

Would this be a good use case to have?

  				
Posted 
	3 years ago

					More  		
  Report
		
					TrickySheep9
				
					0
					 × 1

As in if there are jobs, first level is new pods, second level is new nodes in the cluster.

  				
Posted 
	3 years ago

					More  		
  Report
		
					TrickySheep9
				
					0
					 × 1

Running multiple k8s_daemon rightt? k8s_daemon("1xGPU") and k8s_daemon('cpu') right?

  				
Posted 
	3 years ago

					More  		
  Report
		
					TrickySheep9
				
					0
					 × 1

Aws autoscaler will work with iam rules along as you have it configured on the machine itself. Sagemaker job scheduling (I'm assuming this is what you are referring to, and not the notebook) you need to select the instance as well (basically the same as ec2). What do you mean by using the k8s glue, like inherit and implement the same mechanism but for sagemaker I stead of kubectl ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I think my main point is, k8s glue on aks or gke basically takes care of spinning new nodes, as the k8s service does that. Aws autoscaler is kind of a replacement , make sense?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

15 Answers

3 years ago

2 years ago