Hi @<1798162812862730240:profile|PreciousCentipede43> 🙂
- Regarding bypassing the IAP I am not sure. Could you elaborate a bit? Do you have some expected solution in mind?
- For exposing the interactive sessions you can use a LoadBalancer config as mentioned (if your cloud provider supports its configuration) or use a NodePort service type (making sure there is no firewall rules and you can access the defined ports on the Nodes). Exposing the sessions through an Ingress is supported in t...
@<1734020208089108480:profile|WickedHare16> - please try configuring the cookieDomain
clearml:
cookieDomain: ""
You should set it as your base domain, example pixis.internal , without any api or files in front of it
Hi @<1752864322440138752:profile|GiddyDragonfly90> - Can you try with the last value you proposed, but use : to separate user and password in the string, like this:
externalServices:
elasticsearchConnectionString: '[{"scheme":"http","host":"elastic:toto@elasticsearch-es-http","port":9200}]'
@<1726047624538099712:profile|WorriedSwan6> could you please run a kubectl describe pod of the clearml webserver Pod and dump the output here?
I understand, I'd just like to make sure if that's the root issue and there's no other bug, and if so then you can think of how to automate it via API
Hey @<1743079861380976640:profile|HighKitten20> - Try to configure this section in the values override file for the Agent helm chart:
# -- Private image registry configuration
imageCredentials:
# -- Use private authentication mode
enabled: false
# -- If this is set, chart will not generate a secret but will use what is defined here
existingSecret: ""
# -- Registry name
registry: docker.io
# -- Registry username
username: someone
# -- Registry password
password: pwd...
Hi @<1523708147405950976:profile|AntsyElk37> - There's a few points missing for the PR to be completed, let's follow-up on GitHub. See my comments here None
@<1752864322440138752:profile|GiddyDragonfly90> - MongoDB is used as a dependency Helm Chart from the Bitnami repo. We are using version 12.1.31 of the chart. See this tag None
In the clearml override values, under the mongodb section you can specify any value that is usable in the original chart 🙂
I think Mongo does not like for its db folder to be replaced like this in the running Pod.
You can try by turning off Mongo for a moment (scale it down to 0 replicas from the deployment), then create a one-time Pod (non-mongo, you can use an ubuntu image for example) mounting the same volume that Mongo was mounting, and try using this Pod to copy the db folder in the right place. When it's done, delete this Pod and scale back to 1 the Mongo deployment.
Do you mean the Python version that is installed on the clearml agent itself? Or do you mean the Python version available in tasks that will be run from the agent?
@<1752864322440138752:profile|GiddyDragonfly90> - I think you can also add verify_certs: false in the same elasticsearchConnectionString object, have you tried?
Sure! I'll talk to the guys to update the documentation 🙂
Hey @<1726047624538099712:profile|WorriedSwan6> , the basePodTemplate sections configures the default base template for all pods spawned by the Agent.
If you don't want every Task (or Pod) to use the same requests/limits, one thing you could try is to set up multiple queues in the Agent.
Each queue can then have an override of the Pod template.
So, you can try removing the nvidia.com/gpu : "4" from the root basePodTemplate and add a section like this in ...
Hi @<1798162812862730240:profile|PreciousCentipede43> 🙂
When you say
one of the api backend is UNHEALTHY
do you mean you have multiple replicas of the apiserver component (i.e. you set the values apiserver.replicaCount > 1) and one of them is not ready?
Could you please share the output of the kubectl describe command for the ClearML apiserver Deployment?
I see, in the example you provided you used a comma , to separate username and password, I suggest trying to use a column :
🙂 let me know if that works for you
Hey @<1734020208089108480:profile|WickedHare16> - Not 100% sure this is the issue, but I noticed a wrong configuration in your values.
You configured both these:
elasticsearch:
enabled: true
externalServices:
# -- Existing ElasticSearch connectionstring if elasticsearch.enabled is false (example in values.yaml)
elasticsearchConnectionString: "[{\"host\":\"es_hostname1\",\"port\":9200},{\"host\":\"es_hostname2\",\"port\":9200},{\"host\":\"es_hostname3\",\"port\":9200}]"
Pl...
If that doesn't work, try removing the auth from the connection string and instead define two extraEnvs for the apiserver :
apiserver:
extraEnvs:
- name: CLEARML_ELASTIC_SERVICE_USERNAME
value: "elastic"
- name: CLEARML_ELASTIC_SERVICE_PASSWORD
value: "toto"
So CLEARML8AGENT9KEY1234567890ABCD is the actual real value you are using?
Wonderful - We do not have such feature planned for now, feel free to contribute 🙂
Hi @<1523701907598610432:profile|ReassuredArcticwolf33> - Are you referring to the clearml helm chart or to the clearml-agent one?
Either case, the respective values.yaml file is self-documented and contains example. Here I am reporting an example for adding additional volumes and volume mounts to the apiserver component of the clearml chart:
apiserver:
# -- # Defines extra Kubernetes volumes to be attached to the pod.
additionalVolumes:
- name: ramdisk
empty...
Oh, I see, cause you are using a self-signed certificate, correct?
So, when the UI gets a debug image, it gets the URL for that image, which was created in runtime by the running SDK (by the Agent, in this case), so using the fileserver URL provided by the agent.
You will need to pass the external reference:
agentk8sglue:
fileServerUrlReference: "
"
and work around the self-signed cert. You could try mounting your custom certificates to the Agent using volumes and volumeMounts, storing your certificate in a configmap or similarly
Hey @<1726047624538099712:profile|WorriedSwan6> - I am sorry, I forgot that the multi-queue feature with templateOverrides is only for the enterprise version.
What you can do, though, is to deploy two different agents in k8s using the helm chart. Simply try installing two different releases, then modify only one of them to have basePodTemplate use the nvidia.com/gpu : "4"
Let me know if this solves your issue 🙂
Hello @<1523708147405950976:profile|AntsyElk37> 🙂
You are right, the spec.runtimeClassName field is not supported in the Agent at the moment, I'll work on your Pull Request ASAP.
Could you elaborate a bit about why you need Tasks Pods to specify the runtimeclass to use GPUs?
Usually, you'd need to specify a Pod's container with, for example, resources.limits.nvidia.com/gpu : 1 , and the Nvidia Device Plugin would itself assign the correct device to the container. Will that work?
Hi @<1843461294267568128:profile|KindArcticwolf58> - How did you execute this task?
The k8s_scheduler queue is an internal queue, not intended to be used for enqueuing task. I see you have configured the Agent to watch the gpu queue. Please make sure to create a queue with the same name on the control plane from the UI and restart the Agent, then enqueue the Task on this queue.
Hey @<1736194540286513152:profile|DeliciousSeaturtle82> , yes please try changing the health check to /debug.conf or /debug.ping 🙂