Answered

Hey, How Can We Control The Pod Of Pipelinecontroller Not Use The Gpu? According To The Documentation, The

Hey, how can we control the pod of PipelineController not use the gpu? according to the documentation, the The pipeline controller lives as long as the pipeline itself is being executed.

So if I am deploying an agent on k8s via helm chart, and give the agent the gpu resrouce:

  basePodTemplate:
    resources: 
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"

the controller pod will also get the resource, which is wasting the time of the gpu time-slicing.
help ?

  				
Posted 
	8 months ago

					More  		
  Report
		
					WorriedSwan6
				
					0
					 × 1

Votes Newest

Answers 9

Hi WorriedSwan6 , ideally the pipeline controller would be running on the services agent which is part of the server deployment and does not require GPU resources at all

  				
Posted 
	8 months ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

CooperativeKitten94 thank you! I will try and will update : ))

  				
Posted 
	8 months ago

					More  		
  Report
		
					WorriedSwan6
				
					0
					 × 1

And as it turnout, cannot specify multiply queues:

-- ClearML queue this agent will consume. Multiple queues can be specified with the following format: queue1,queue2,queue3

queue: default
gives an error

  				
Posted 
	8 months ago

					More  		
  Report
		
					WorriedSwan6
				
					0
					 × 1

Hey CostlyOstrich36
Can you explain this point a bit more?
In the helmchart of the agent I configure like so:
...

agentk8sglue:
  extraEnvs: 
  - name: AWS_ACCESS_KEY_ID
    valueFrom:
      secretKeyRef:
        name: aws-access-key-id
        key: AWS_ACCESS_KEY_ID
  - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: aws-secret-access-key
        key: AWS_SECRET_ACCESS_KEY
  - name: K8S_GLUE_MAX_PODS
    value: '1'
  - name: CLEARML_AGENT_GIT_USER
    value: clearml-agent
  - name: CLEARML_AGENT_GIT_PASS
    valueFrom:
      secretKeyRef:
        name: clearml-repo-read
        key: token
  nodeSelector: {}
  defaultContainerImage: <some_image>
  queue: services
  apiServerUrlReference: "

"
  fileServerUrlReference: "

"
  webServerUrlReference: "

"
  basePodTemplate:
    resources: 
      requests:
        nvidia.com/gpu: "4"
      limits:
        nvidia.com/gpu: "4"

So on every step the agent lift a pod, and every pod have access to 4 replica of the gpu, which is ok.. but how to prevent the controller from getting the access ?

  				
Posted 
	8 months ago

					More  		
  Report
		
					WorriedSwan6
				
					0
					 × 1

CooperativeKitten94 Running the following conf:

  queue:
    services-tasks:
      templateOverrides:
        resources: 
          requests:
            nvidia.com/gpu: "1"
          limits:
            nvidia.com/gpu: "1"
    services: 
      templateOverrides:
        resources: 
          requests:
            nvidia.com/gpu: "0"
          limits:
            nvidia.com/gpu: "0"


  apiServerUrlReference: "

"
  fileServerUrlReference: "

"
  webServerUrlReference: "

"
  basePodTemplate:
    resources: 
      requests:
        nvidia.com/gpu: "2"
      limits:
        nvidia.com/gpu: "2"

cause the agent-pod to be in a crash-loopback:

python3 k8s_glue_example.py --queue 'map[services:map[templateOverrides:map[resources:map[limits:map[nvidia.com/gpu:0]' 'requests:map[nvidia.com/gpu:0]]]]' 'services-tasks:map[templateOverrides:map[resources:map[limits:map[nvidia.com/gpu:1]' 'requests:map[nvidia.com/gpu:1]]]]]' --max-pods 2 --namespace clearml --template-yaml /root/template/template.yaml
/usr/local/lib/python3.6/dist-packages/jwt/utils.py:7: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
  from cryptography.hazmat.primitives.asymmetric.ec import EllipticCurve
usage: k8s_glue_example.py [-h] [--queue QUEUE] [--ports-mode]
                           [--num-of-services NUM_OF_SERVICES]
                           [--base-port BASE_PORT]
                           [--base-pod-num BASE_POD_NUM]
                           [--gateway-address GATEWAY_ADDRESS]
                           [--pod-clearml-conf POD_CLEARML_CONF]
                           [--overrides-yaml OVERRIDES_YAML]
                           [--template-yaml TEMPLATE_YAML]
                           [--ssh-server-port SSH_SERVER_PORT]
                           [--namespace NAMESPACE] [--max-pods MAX_PODS]
                           [--use-owner-token] [--standalone-mode]
                           [--child-report-tags CHILD_REPORT_TAGS [CHILD_REPORT_TAGS ...]]
k8s_glue_example.py: error: unrecognized arguments: requests:map[nvidia.com/gpu:0]]]] services-tasks:map[templateOverrides:map[resources:map[limits:map[nvidia.com/gpu:1] requests:map[nvidia.com/gpu:1]]]]]

  				
Posted 
	8 months ago

					More  		
  Report
		
					WorriedSwan6
				
					0
					 × 1

I must add I do not see anything in the helmchart for using templateOverrides

  				
Posted 
	8 months ago

					More  		
  Report
		
					WorriedSwan6
				
					0
					 × 1

Hey CooperativeKitten94 yes, it did! 🙂
I thank you for the support.

  				
Posted 
	8 months ago

					More  		
  Report
		
					WorriedSwan6
				
					0
					 × 1

Hey WorriedSwan6 - I am sorry, I forgot that the multi-queue feature with templateOverrides is only for the enterprise version.
What you can do, though, is to deploy two different agents in k8s using the helm chart. Simply try installing two different releases, then modify only one of them to have basePodTemplate use the nvidia.com/gpu : "4"
Let me know if this solves your issue 🙂

  				
Posted 
	8 months ago

					More  		
  Report
		
					CooperativeKitten94
				
					0

Hey WorriedSwan6 , the basePodTemplate sections configures the default base template for all pods spawned by the Agent.
If you don't want every Task (or Pod) to use the same requests/limits, one thing you could try is to set up multiple queues in the Agent.
Each queue can then have an override of the Pod template.
So, you can try removing the nvidia.com/gpu : "4" from the root basePodTemplate and add a section like this in the values instead:

agentk8sglue:
  createQueues: true
  queues:
    myQueueWith4GPUs:
      templateOverrides:
        resources: 
          requests:
            nvidia.com/gpu: "4"
          limits:
            nvidia.com/gpu: "4"

When you want a Task to use the 4 slices GPU, you simply need to enqueue it on this myQueueWith4GPUs queue, otherwise it won't have the nvidia.com/gpu : "4" resources

  				
Posted 
	8 months ago

					More  		
  Report
		
					CooperativeKitten94
				
					0

Write your answer

630 Views

9 Answers

8 months ago