Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey, How Can We Control The Pod Of Pipelinecontroller Not Use The Gpu? According To The Documentation, The

Hey, how can we control the pod of PipelineController not use the gpu? according to the documentation, the The pipeline controller lives as long as the pipeline itself is being executed.

So if I am deploying an agent on k8s via helm chart, and give the agent the gpu resrouce:

  basePodTemplate:
    resources: 
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"

the controller pod will also get the resource, which is wasting the time of the gpu time-slicing.
help ?

  
  
Posted 3 months ago
Votes Newest

Answers 9


Hey @<1726047624538099712:profile|WorriedSwan6> - I am sorry, I forgot that the multi-queue feature with templateOverrides is only for the enterprise version.
What you can do, though, is to deploy two different agents in k8s using the helm chart. Simply try installing two different releases, then modify only one of them to have basePodTemplate use the nvidia.com/gpu : "4"
Let me know if this solves your issue 🙂

  
  
Posted 3 months ago

@<1729671499981262848:profile|CooperativeKitten94> thank you! I will try and will update : ))

  
  
Posted 3 months ago

Hey @<1726047624538099712:profile|WorriedSwan6> , the basePodTemplate sections configures the default base template for all pods spawned by the Agent.
If you don't want every Task (or Pod) to use the same requests/limits, one thing you could try is to set up multiple queues in the Agent.
Each queue can then have an override of the Pod template.
So, you can try removing the nvidia.com/gpu : "4" from the root basePodTemplate and add a section like this in the values instead:

agentk8sglue:
  createQueues: true
  queues:
    myQueueWith4GPUs:
      templateOverrides:
        resources: 
          requests:
            nvidia.com/gpu: "4"
          limits:
            nvidia.com/gpu: "4"

When you want a Task to use the 4 slices GPU, you simply need to enqueue it on this myQueueWith4GPUs queue, otherwise it won't have the nvidia.com/gpu : "4" resources

  
  
Posted 3 months ago

I must add I do not see anything in the helmchart for using templateOverrides

  
  
Posted 3 months ago

Hey @<1729671499981262848:profile|CooperativeKitten94> yes, it did! 🙂
I thank you for the support.

  
  
Posted 3 months ago

Hi @<1726047624538099712:profile|WorriedSwan6> , ideally the pipeline controller would be running on the services agent which is part of the server deployment and does not require GPU resources at all

  
  
Posted 3 months ago

@<1729671499981262848:profile|CooperativeKitten94> Running the following conf:

  queue:
    services-tasks:
      templateOverrides:
        resources: 
          requests:
            nvidia.com/gpu: "1"
          limits:
            nvidia.com/gpu: "1"
    services: 
      templateOverrides:
        resources: 
          requests:
            nvidia.com/gpu: "0"
          limits:
            nvidia.com/gpu: "0"


  apiServerUrlReference: "
"
  fileServerUrlReference: "
"
  webServerUrlReference: "
"
  basePodTemplate:
    resources: 
      requests:
        nvidia.com/gpu: "2"
      limits:
        nvidia.com/gpu: "2"

cause the agent-pod to be in a crash-loopback:

python3 k8s_glue_example.py --queue 'map[services:map[templateOverrides:map[resources:map[limits:map[nvidia.com/gpu:0]' 'requests:map[nvidia.com/gpu:0]]]]' 'services-tasks:map[templateOverrides:map[resources:map[limits:map[nvidia.com/gpu:1]' 'requests:map[nvidia.com/gpu:1]]]]]' --max-pods 2 --namespace clearml --template-yaml /root/template/template.yaml
/usr/local/lib/python3.6/dist-packages/jwt/utils.py:7: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
  from cryptography.hazmat.primitives.asymmetric.ec import EllipticCurve
usage: k8s_glue_example.py [-h] [--queue QUEUE] [--ports-mode]
                           [--num-of-services NUM_OF_SERVICES]
                           [--base-port BASE_PORT]
                           [--base-pod-num BASE_POD_NUM]
                           [--gateway-address GATEWAY_ADDRESS]
                           [--pod-clearml-conf POD_CLEARML_CONF]
                           [--overrides-yaml OVERRIDES_YAML]
                           [--template-yaml TEMPLATE_YAML]
                           [--ssh-server-port SSH_SERVER_PORT]
                           [--namespace NAMESPACE] [--max-pods MAX_PODS]
                           [--use-owner-token] [--standalone-mode]
                           [--child-report-tags CHILD_REPORT_TAGS [CHILD_REPORT_TAGS ...]]
k8s_glue_example.py: error: unrecognized arguments: requests:map[nvidia.com/gpu:0]]]] services-tasks:map[templateOverrides:map[resources:map[limits:map[nvidia.com/gpu:1] requests:map[nvidia.com/gpu:1]]]]]
  
  
Posted 3 months ago

Hey @<1523701070390366208:profile|CostlyOstrich36>
Can you explain this point a bit more?
In the helmchart of the agent I configure like so:
...

agentk8sglue:
  extraEnvs: 
  - name: AWS_ACCESS_KEY_ID
    valueFrom:
      secretKeyRef:
        name: aws-access-key-id
        key: AWS_ACCESS_KEY_ID
  - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: aws-secret-access-key
        key: AWS_SECRET_ACCESS_KEY
  - name: K8S_GLUE_MAX_PODS
    value: '1'
  - name: CLEARML_AGENT_GIT_USER
    value: clearml-agent
  - name: CLEARML_AGENT_GIT_PASS
    valueFrom:
      secretKeyRef:
        name: clearml-repo-read
        key: token
  nodeSelector: {}
  defaultContainerImage: <some_image>
  queue: services
  apiServerUrlReference: "
"
  fileServerUrlReference: "
"
  webServerUrlReference: "
"
  basePodTemplate:
    resources: 
      requests:
        nvidia.com/gpu: "4"
      limits:
        nvidia.com/gpu: "4"

So on every step the agent lift a pod, and every pod have access to 4 replica of the gpu, which is ok.. but how to prevent the controller from getting the access ?

  
  
Posted 3 months ago

And as it turnout, cannot specify multiply queues:

-- ClearML queue this agent will consume. Multiple queues can be specified with the following format: queue1,queue2,queue3

queue: default
gives an error

  
  
Posted 3 months ago