Hi @<1726047624538099712:profile|WorriedSwan6> , ideally the pipeline controller would be running on the services agent which is part of the server deployment and does not require GPU resources at all
Hey @<1523701070390366208:profile|CostlyOstrich36>
Can you explain this point a bit more?
In the helmchart of the agent I configure like so:
...
agentk8sglue:
extraEnvs:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-access-key-id
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-secret-access-key
key: AWS_SECRET_ACCESS_KEY
- name: K8S_GLUE_MAX_PODS
value: '1'
- name: CLEARML_AGENT_GIT_USER
value: clearml-agent
- name: CLEARML_AGENT_GIT_PASS
valueFrom:
secretKeyRef:
name: clearml-repo-read
key: token
nodeSelector: {}
defaultContainerImage: <some_image>
queue: services
apiServerUrlReference: "
"
fileServerUrlReference: "
"
webServerUrlReference: "
"
basePodTemplate:
resources:
requests:
nvidia.com/gpu: "4"
limits:
nvidia.com/gpu: "4"
So on every step the agent lift a pod, and every pod have access to 4 replica of the gpu, which is ok.. but how to prevent the controller from getting the access ?
Hey @<1726047624538099712:profile|WorriedSwan6> , the basePodTemplate
sections configures the default base template for all pods spawned by the Agent.
If you don't want every Task (or Pod) to use the same requests/limits, one thing you could try is to set up multiple queues in the Agent.
Each queue can then have an override of the Pod template.
So, you can try removing the nvidia.com/gpu : "4"
from the root basePodTemplate
and add a section like this in the values instead:
agentk8sglue:
createQueues: true
queues:
myQueueWith4GPUs:
templateOverrides:
resources:
requests:
nvidia.com/gpu: "4"
limits:
nvidia.com/gpu: "4"
When you want a Task to use the 4 slices GPU, you simply need to enqueue it on this myQueueWith4GPUs
queue, otherwise it won't have the nvidia.com/gpu : "4"
resources
@<1729671499981262848:profile|CooperativeKitten94> thank you! I will try and will update : ))
@<1729671499981262848:profile|CooperativeKitten94> Running the following conf:
queue:
services-tasks:
templateOverrides:
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
services:
templateOverrides:
resources:
requests:
nvidia.com/gpu: "0"
limits:
nvidia.com/gpu: "0"
apiServerUrlReference: "
"
fileServerUrlReference: "
"
webServerUrlReference: "
"
basePodTemplate:
resources:
requests:
nvidia.com/gpu: "2"
limits:
nvidia.com/gpu: "2"
cause the agent-pod to be in a crash-loopback:
python3 k8s_glue_example.py --queue 'map[services:map[templateOverrides:map[resources:map[limits:map[nvidia.com/gpu:0]' 'requests:map[nvidia.com/gpu:0]]]]' 'services-tasks:map[templateOverrides:map[resources:map[limits:map[nvidia.com/gpu:1]' 'requests:map[nvidia.com/gpu:1]]]]]' --max-pods 2 --namespace clearml --template-yaml /root/template/template.yaml
/usr/local/lib/python3.6/dist-packages/jwt/utils.py:7: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography and will be removed in a future release.
from cryptography.hazmat.primitives.asymmetric.ec import EllipticCurve
usage: k8s_glue_example.py [-h] [--queue QUEUE] [--ports-mode]
[--num-of-services NUM_OF_SERVICES]
[--base-port BASE_PORT]
[--base-pod-num BASE_POD_NUM]
[--gateway-address GATEWAY_ADDRESS]
[--pod-clearml-conf POD_CLEARML_CONF]
[--overrides-yaml OVERRIDES_YAML]
[--template-yaml TEMPLATE_YAML]
[--ssh-server-port SSH_SERVER_PORT]
[--namespace NAMESPACE] [--max-pods MAX_PODS]
[--use-owner-token] [--standalone-mode]
[--child-report-tags CHILD_REPORT_TAGS [CHILD_REPORT_TAGS ...]]
k8s_glue_example.py: error: unrecognized arguments: requests:map[nvidia.com/gpu:0]]]] services-tasks:map[templateOverrides:map[resources:map[limits:map[nvidia.com/gpu:1] requests:map[nvidia.com/gpu:1]]]]]
I must add I do not see anything in the helmchart for using templateOverrides
And as it turnout, cannot specify multiply queues:
-- ClearML queue this agent will consume. Multiple queues can be specified with the following format: queue1,queue2,queue3
queue: default
gives an error
Hey @<1726047624538099712:profile|WorriedSwan6> - I am sorry, I forgot that the multi-queue feature with templateOverrides is only for the enterprise version.
What you can do, though, is to deploy two different agents in k8s using the helm chart. Simply try installing two different releases, then modify only one of them to have basePodTemplate use the nvidia.com/gpu
: "4"
Let me know if this solves your issue 🙂
Hey @<1729671499981262848:profile|CooperativeKitten94> yes, it did! 🙂
I thank you for the support.