Hey @<1726047624538099712:profile|WorriedSwan6> - I am sorry, I forgot that the multi-queue feature with templateOverrides is only for the enterprise version.
What you can do, though, is to deploy two different agents in k8s using the helm chart. Simply try installing two different releases, then modify only one of them to have basePodTemplate use the nvidia.com/gpu
: "4"
Let me know if this solves your issue 🙂
@<1669152726245707776:profile|ManiacalParrot65> could you please send your values file override for the Agent helm chart?
Sure! I'll talk to the guys to update the documentation 🙂
Oh, I see, cause you are using a self-signed certificate, correct?
Can you try with these values? For instance the changes are: not using clearmlConfig, not overriding the image and use default, not defining resources
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
queue: default
webServerUrlReference:
clearml:
agentk8sglueKey: 8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT
agentk8sglueSecret: oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWear...
@<1726047624538099712:profile|WorriedSwan6> - When deploying the ClearML Agent, could you try passing the external fileserver url to the configuration you previously mentioned? Like this:
agentk8sglue:
fileServerUrlReference: "
"
value can be DEBUG, INFO, WARNING, ERROR, CRITICAL
Do you mean the Python version that is installed on the clearml agent itself? Or do you mean the Python version available in tasks that will be run from the agent?
So, when the UI gets a debug image, it gets the URL for that image, which was created in runtime by the running SDK (by the Agent, in this case), so using the fileserver URL provided by the agent.
You will need to pass the external reference:
agentk8sglue:
fileServerUrlReference: "
"
and work around the self-signed cert. You could try mounting your custom certificates to the Agent using volumes and volumeMounts, storing your certificate in a configmap or similarly
Hey @<1726047624538099712:profile|WorriedSwan6> , the basePodTemplate
sections configures the default base template for all pods spawned by the Agent.
If you don't want every Task (or Pod) to use the same requests/limits, one thing you could try is to set up multiple queues in the Agent.
Each queue can then have an override of the Pod template.
So, you can try removing the nvidia.com/gpu : "4"
from the root basePodTemplate
and add a section like this in ...
Hi @<1523701907598610432:profile|ReassuredArcticwolf33> - Are you referring to the clearml
helm chart or to the clearml-agent
one?
Either case, the respective values.yaml
file is self-documented and contains example. Here I am reporting an example for adding additional volumes and volume mounts to the apiserver component of the clearml chart:
apiserver:
# -- # Defines extra Kubernetes volumes to be attached to the pod.
additionalVolumes:
- name: ramdisk
empty...
Hey @<1523701304709353472:profile|OddShrimp85> - You can tweak the following section in the clearml-agent override values:
# -- Global parameters section
global:
# -- Images registry
imageRegistry: "docker.io"
# -- Private image registry configuration
imageCredentials:
# -- Use private authentication mode
enabled: true # <-- Set this to true
# -- Registry name
registry: docker.io
# -- Registry username
username: someone
# -- Registry password
password: pwd
# -- ...
@<1726047624538099712:profile|WorriedSwan6> could you please run a kubectl describe pod
of the clearml webserver Pod and dump the output here?
Hi @<1798162812862730240:profile|PreciousCentipede43> 🙂
- Regarding bypassing the IAP I am not sure. Could you elaborate a bit? Do you have some expected solution in mind?
- For exposing the interactive sessions you can use a LoadBalancer config as mentioned (if your cloud provider supports its configuration) or use a NodePort service type (making sure there is no firewall rules and you can access the defined ports on the Nodes). Exposing the sessions through an Ingress is supported in t...
Hey @<1743079861380976640:profile|HighKitten20> - Try to configure this section in the values override file for the Agent helm chart:
# -- Private image registry configuration
imageCredentials:
# -- Use private authentication mode
enabled: false
# -- If this is set, chart will not generate a secret but will use what is defined here
existingSecret: ""
# -- Registry name
registry: docker.io
# -- Registry username
username: someone
# -- Registry password
password: pwd...
Hey @<1736194540286513152:profile|DeliciousSeaturtle82> , yes please try changing the health check to /debug.conf
or /debug.ping
🙂
For tasks Pods running your experiments through the agent you can change the base image to something you like and have the Python version you need. You can use this section of the values:
agentk8sglue:
# -- default container image for ClearML Task pod
defaultContainerImage: ubuntu:18.04 # <-- Change me!!
Hi @<1798162812862730240:profile|PreciousCentipede43> 🙂
When you say
one of the api backend is UNHEALTHY
do you mean you have multiple replicas of the apiserver component (i.e. you set the values apiserver.replicaCount
> 1) and one of them is not ready?
Could you please share the output of the kubectl describe command for the ClearML apiserver Deployment?
Hey @<1734020156465614848:profile|ClearKitten90> - You can try with the following in your ClearML Agent override helm values. Make sure to replace mygitusername
and git-password
agentk8sglue:
basePodTemplate:
env:
# to setup access to private repo, setup secret with git credentials
- name: CLEARML_AGENT_GIT_USER
value: mygitusername
- name: CLEARML_AGENT_GIT_PASS
valueFrom:
secretKeyRef:
name: git-password
...
@<1752864322440138752:profile|GiddyDragonfly90> - MongoDB is used as a dependency Helm Chart from the Bitnami repo. We are using version 12.1.31 of the chart. See this tag None
In the clearml override values, under the mongodb
section you can specify any value that is usable in the original chart 🙂
@<1736194540286513152:profile|DeliciousSeaturtle82> when you copy the folder on the new pod, it crashes almost instantly?
@<1710827340621156352:profile|HungryFrog27> have you installed the Nvidia gpu-operator to advertise GPUs to Kubernetes?
Hey @<1734020208089108480:profile|WickedHare16> - Not 100% sure this is the issue, but I noticed a wrong configuration in your values.
You configured both these:
elasticsearch:
enabled: true
externalServices:
# -- Existing ElasticSearch connectionstring if elasticsearch.enabled is false (example in values.yaml)
elasticsearchConnectionString: "[{\"host\":\"es_hostname1\",\"port\":9200},{\"host\":\"es_hostname2\",\"port\":9200},{\"host\":\"es_hostname3\",\"port\":9200}]"
Pl...
@<1734020208089108480:profile|WickedHare16> - please try configuring the cookieDomain
clearml:
cookieDomain: ""
You should set it as your base domain, example pixis.internal
, without any api or files in front of it
@<1752864322440138752:profile|GiddyDragonfly90> - I think you can also add verify_certs: false
in the same elasticsearchConnectionString
object, have you tried?
@<1736194540286513152:profile|DeliciousSeaturtle82> the data folder for mongo4 and mongo5 might be slightly different. What is the target path where you're moving data in mongo5? And how is that mounted?
And when you say "broken", could you elaborate on that? Does the target Mongo Pod crash when trying to move the data? Or you succeed in copying the data but can't see the result in the UI?
I understand, I'd just like to make sure if that's the root issue and there's no other bug, and if so then you can think of how to automate it via API
Hello @<1523708147405950976:profile|AntsyElk37> 🙂
You are right, the spec.runtimeClassName
field is not supported in the Agent at the moment, I'll work on your Pull Request ASAP.
Could you elaborate a bit about why you need Tasks Pods to specify the runtimeclass to use GPUs?
Usually, you'd need to specify a Pod's container with, for example, resources.limits.nvidia.com/gpu
: 1
, and the Nvidia Device Plugin would itself assign the correct device to the container. Will that work?
If that doesn't work, try removing the auth from the connection string and instead define two extraEnvs
for the apiserver
:
apiserver:
extraEnvs:
- name: CLEARML_ELASTIC_SERVICE_USERNAME
value: "elastic"
- name: CLEARML_ELASTIC_SERVICE_PASSWORD
value: "toto"