Hey @<1734020208089108480:profile|WickedHare16> , could you please share your override values file for the clearml helm chart?
@<1736194540286513152:profile|DeliciousSeaturtle82> when you copy the folder on the new pod, it crashes almost instantly?
Hey @<1734020208089108480:profile|WickedHare16> - Not 100% sure this is the issue, but I noticed a wrong configuration in your values.
You configured both these:
elasticsearch:
enabled: true
externalServices:
# -- Existing ElasticSearch connectionstring if elasticsearch.enabled is false (example in values.yaml)
elasticsearchConnectionString: "[{\"host\":\"es_hostname1\",\"port\":9200},{\"host\":\"es_hostname2\",\"port\":9200},{\"host\":\"es_hostname3\",\"port\":9200}]"
Pl...
Do you mean the Python version that is installed on the clearml agent itself? Or do you mean the Python version available in tasks that will be run from the agent?
Sure! I'll talk to the guys to update the documentation 🙂
Hey @<1736194540286513152:profile|DeliciousSeaturtle82> , yes please try changing the health check to /debug.conf
or /debug.ping
🙂
I think Mongo does not like for its db folder to be replaced like this in the running Pod.
You can try by turning off Mongo for a moment (scale it down to 0 replicas from the deployment), then create a one-time Pod (non-mongo, you can use an ubuntu image for example) mounting the same volume that Mongo was mounting, and try using this Pod to copy the db folder in the right place. When it's done, delete this Pod and scale back to 1 the Mongo deployment.
@<1736194540286513152:profile|DeliciousSeaturtle82> the data folder for mongo4 and mongo5 might be slightly different. What is the target path where you're moving data in mongo5? And how is that mounted?
And when you say "broken", could you elaborate on that? Does the target Mongo Pod crash when trying to move the data? Or you succeed in copying the data but can't see the result in the UI?
For tasks Pods running your experiments through the agent you can change the base image to something you like and have the Python version you need. You can use this section of the values:
agentk8sglue:
# -- default container image for ClearML Task pod
defaultContainerImage: ubuntu:18.04 # <-- Change me!!
@<1734020208089108480:profile|WickedHare16> - please try configuring the cookieDomain
clearml:
cookieDomain: ""
You should set it as your base domain, example pixis.internal
, without any api or files in front of it
Hey @<1649221394904387584:profile|RattySparrow90> - You can try configuring CLEARML__logging__root__level
as an extraEnvs for the apiserver and fileserver 🙂
Hey @<1726047624538099712:profile|WorriedSwan6> - I am sorry, I forgot that the multi-queue feature with templateOverrides is only for the enterprise version.
What you can do, though, is to deploy two different agents in k8s using the helm chart. Simply try installing two different releases, then modify only one of them to have basePodTemplate use the nvidia.com/gpu
: "4"
Let me know if this solves your issue 🙂
Hey @<1734020156465614848:profile|ClearKitten90> - You can try with the following in your ClearML Agent override helm values. Make sure to replace mygitusername
and git-password
agentk8sglue:
basePodTemplate:
env:
# to setup access to private repo, setup secret with git credentials
- name: CLEARML_AGENT_GIT_USER
value: mygitusername
- name: CLEARML_AGENT_GIT_PASS
valueFrom:
secretKeyRef:
name: git-password
...
@<1669152726245707776:profile|ManiacalParrot65> could you please send your values file override for the Agent helm chart?
value can be DEBUG, INFO, WARNING, ERROR, CRITICAL
Hey @<1523701304709353472:profile|OddShrimp85> - You can tweak the following section in the clearml-agent override values:
# -- Global parameters section
global:
# -- Images registry
imageRegistry: "docker.io"
# -- Private image registry configuration
imageCredentials:
# -- Use private authentication mode
enabled: true # <-- Set this to true
# -- Registry name
registry: docker.io
# -- Registry username
username: someone
# -- Registry password
password: pwd
# -- ...
Hey @<1726047624538099712:profile|WorriedSwan6> , the basePodTemplate
sections configures the default base template for all pods spawned by the Agent.
If you don't want every Task (or Pod) to use the same requests/limits, one thing you could try is to set up multiple queues in the Agent.
Each queue can then have an override of the Pod template.
So, you can try removing the nvidia.com/gpu : "4"
from the root basePodTemplate
and add a section like this in ...