Hello! I Had Trouble Running Clearml-Agent On K8S. I Fixed It By Modifying The Helm Chart To Allow Specifying Runtimeclassname (Which Is Needed When Using Nvidia Gpu Operator). I Did This,

Answered

hello! i had trouble running clearml-agent on k8s. i fixed it by modifying the helm chart to allow specifying runtimeClassName (which is needed when using nvidia gpu operator). i did this, None . its trivial. should i do anything more than this ?anybody else running clearml agent on a kubernetes cluster with nvidia gpu-operator ?

  				
Posted 
	12 days ago

					More  		
  Report
		
					AntsyElk37
				
					0
					 × 1

Votes Newest

Answers 3

this seems to be confirmed by this documentation None If you have not changed the default runtime on your GPU nodes, you must explicitly request the NVIDIA runtime by setting runtimeClassName: nvidia in the Pod spec:

  				
Posted 
	12 days ago

					More  		
  Report
		
					AntsyElk37
				
					0
					 × 1

i'm still trying to understand why it was needed in our case. i have a nvidia gpu operator with mostly the default values installed on our on prem cluster. i found there is an option CONTAINERD_SET_AS_DEFAULT in the operator, which, when enabled, puts the runtime for all pods. we didn't enable that option, maybe if we had enabled it would have worked.

  				
Posted 
	12 days ago

					More  		
  Report
		
					AntsyElk37
				
					0
					 × 1

Hello AntsyElk37 🙂
You are right, the spec.runtimeClassName field is not supported in the Agent at the moment, I'll work on your Pull Request ASAP.
Could you elaborate a bit about why you need Tasks Pods to specify the runtimeclass to use GPUs?
Usually, you'd need to specify a Pod's container with, for example, resources.limits.nvidia.com/gpu : 1 , and the Nvidia Device Plugin would itself assign the correct device to the container. Will that work?

  				
Posted 
	12 days ago

					More  		
  Report
		
					CooperativeKitten94
				
					0

Write your answer

74 Views

3 Answers

12 days ago

11 days ago