Hello! I Had Trouble Running Clearml-Agent On K8S. I Fixed It By Modifying The Helm Chart To Allow Specifying Runtimeclassname (Which Is Needed When Using Nvidia Gpu Operator). I Did This,

Answered

hello! i had trouble running clearml-agent on k8s. i fixed it by modifying the helm chart to allow specifying runtimeClassName (which is needed when using nvidia gpu operator). i did this, None . its trivial. should i do anything more than this ?anybody else running clearml agent on a kubernetes cluster with nvidia gpu-operator ?

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					AntsyElk37
				
					0
					 × 1

Votes Newest

Answers 6

i'm still trying to understand why it was needed in our case. i have a nvidia gpu operator with mostly the default values installed on our on prem cluster. i found there is an option CONTAINERD_SET_AS_DEFAULT in the operator, which, when enabled, puts the runtime for all pods. we didn't enable that option, maybe if we had enabled it would have worked.

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					AntsyElk37
				
					0
					 × 1

Hello @<1523708147405950976:profile|AntsyElk37> 🙂
You are right, the spec.runtimeClassName field is not supported in the Agent at the moment, I'll work on your Pull Request ASAP.
Could you elaborate a bit about why you need Tasks Pods to specify the runtimeclass to use GPUs?
Usually, you'd need to specify a Pod's container with, for example, resources.limits.nvidia.com/gpu : 1 , and the Nvidia Device Plugin would itself assign the correct device to the container. Will that work?

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					CooperativeKitten94
				
					0

hi @<1729671499981262848:profile|CooperativeKitten94> did i convince you with my argument ? do you think having runtimeClass configurable is worth it ?

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					AntsyElk37
				
					0
					 × 1

Hi @<1523708147405950976:profile|AntsyElk37> - There's a few points missing for the PR to be completed, let's follow-up on GitHub. See my comments here None

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					CooperativeKitten94
				
					0

this seems to be confirmed by this documentation None If you have not changed the default runtime on your GPU nodes, you must explicitly request the NVIDIA runtime by setting runtimeClassName: nvidia in the Pod spec:

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					AntsyElk37
				
					0
					 × 1

Hi @<1523708147405950976:profile|AntsyElk37> - Yes, having the runtimeClass makes sense. I am handling your PR soon 🙂

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					CooperativeKitten94
				
					0

Write your answer

1K Views

6 Answers

6 months ago