I think Mongo does not like for its db folder to be replaced like this in the running Pod.
You can try by turning off Mongo for a moment (scale it down to 0 replicas from the deployment), then create a one-time Pod (non-mongo, you can use an ubuntu image for example) mounting the same volume that Mongo was mounting, and try using this Pod to copy the db folder in the right place. When it's done, delete this Pod and scale back to 1 the Mongo deployment.
Hello @<1523708147405950976:profile|AntsyElk37> 🙂
You are right, the spec.runtimeClassName
field is not supported in the Agent at the moment, I'll work on your Pull Request ASAP.
Could you elaborate a bit about why you need Tasks Pods to specify the runtimeclass to use GPUs?
Usually, you'd need to specify a Pod's container with, for example, resources.limits.nvidia.com/gpu
: 1
, and the Nvidia Device Plugin would itself assign the correct device to the container. Will that work?
Hey @<1734020156465614848:profile|ClearKitten90> - You can try with the following in your ClearML Agent override helm values. Make sure to replace mygitusername
and git-password
agentk8sglue:
basePodTemplate:
env:
# to setup access to private repo, setup secret with git credentials
- name: CLEARML_AGENT_GIT_USER
value: mygitusername
- name: CLEARML_AGENT_GIT_PASS
valueFrom:
secretKeyRef:
name: git-password
...
Sure! I'll talk to the guys to update the documentation 🙂
Hey @<1743079861380976640:profile|HighKitten20> - Try to configure this section in the values override file for the Agent helm chart:
# -- Private image registry configuration
imageCredentials:
# -- Use private authentication mode
enabled: false
# -- If this is set, chart will not generate a secret but will use what is defined here
existingSecret: ""
# -- Registry name
registry: docker.io
# -- Registry username
username: someone
# -- Registry password
password: pwd...
In your last message, you are referring to pod security context and admission controllers enforcing some policies such as a read-only filesystem. Is that the case in your cluster?
Or was this some output of a GPT-like chat? If yes, please do not use LLMs to generate values for the helm installation as they're usually not providing a useful or real config
So if you now run helm get values clearml-agent -n <NAMESPACE>
where <NAMESPACE>
is the value you have in the $NS
variable, can you confirm this is the full and only output? Of course the $VARIABLES
will have their real value
agentk8sglue:
# Try newer image version to fix Python 3.6 regex issue
image:
repository: allegroai/clearml-agent-k8s-base
tag: "1.25-1"
pullPolicy: Always
apiServerUrlReference: "http://$NODE_IP:30008"
fileServerUrlReference: "ht...
Wonderful - We do not have such feature planned for now, feel free to contribute 🙂
Do you mean the Python version that is installed on the clearml agent itself? Or do you mean the Python version available in tasks that will be run from the agent?
Hey @<1734020208089108480:profile|WickedHare16> , could you please share your override values file for the clearml helm chart?
Hey @<1736194540286513152:profile|DeliciousSeaturtle82> , yes please try changing the health check to /debug.conf
or /debug.ping
🙂
@<1736194540286513152:profile|DeliciousSeaturtle82> the data folder for mongo4 and mongo5 might be slightly different. What is the target path where you're moving data in mongo5? And how is that mounted?
And when you say "broken", could you elaborate on that? Does the target Mongo Pod crash when trying to move the data? Or you succeed in copying the data but can't see the result in the UI?
It's a bit hard for me to provide support here with the additional layer of Argo.
I assume the server is working fine and you can open the clearml UI and log in, right? If yes, would it be possible to extract the Agent part only, out of Argo, and proceed installing it through standard helm?
Hi @<1798162812862730240:profile|PreciousCentipede43> 🙂
When you say
one of the api backend is UNHEALTHY
do you mean you have multiple replicas of the apiserver component (i.e. you set the values apiserver.replicaCount
> 1) and one of them is not ready?
Could you please share the output of the kubectl describe command for the ClearML apiserver Deployment?
Hi Amir, could you please share the values override that you used to install the clearml server helm chart?
Hey @<1726047624538099712:profile|WorriedSwan6> - I am sorry, I forgot that the multi-queue feature with templateOverrides is only for the enterprise version.
What you can do, though, is to deploy two different agents in k8s using the helm chart. Simply try installing two different releases, then modify only one of them to have basePodTemplate use the nvidia.com/gpu
: "4"
Let me know if this solves your issue 🙂
@<1669152726245707776:profile|ManiacalParrot65> could you please send your values file override for the Agent helm chart?
Hey @<1649221394904387584:profile|RattySparrow90> - You can try configuring CLEARML__logging__root__level
as an extraEnvs for the apiserver and fileserver 🙂
value can be DEBUG, INFO, WARNING, ERROR, CRITICAL
Hey @<1734020208089108480:profile|WickedHare16> - Not 100% sure this is the issue, but I noticed a wrong configuration in your values.
You configured both these:
elasticsearch:
enabled: true
externalServices:
# -- Existing ElasticSearch connectionstring if elasticsearch.enabled is false (example in values.yaml)
elasticsearchConnectionString: "[{\"host\":\"es_hostname1\",\"port\":9200},{\"host\":\"es_hostname2\",\"port\":9200},{\"host\":\"es_hostname3\",\"port\":9200}]"
Pl...
@<1734020208089108480:profile|WickedHare16> - please try configuring the cookieDomain
clearml:
cookieDomain: ""
You should set it as your base domain, example pixis.internal
, without any api or files in front of it
Oh no worries, I understand 😄
Sure, if you could share the whole values and configs you're using to run both the server and agent that would be useful.
Also what about other Pods from the ClearML server, are there any other crash or similar error referring to a read-only filesystem? Are the server and agent installed on the same K8s node?
Oh, I see, cause you are using a self-signed certificate, correct?
Also, in order to simplify the installation, can you use a simpler version of your values for now, something like this should work:
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
queue: default
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
webServerUrlReference:
clearml:
agentk8sglueKey: <NEW_KEY>...
@<1726047624538099712:profile|WorriedSwan6> could you please run a kubectl describe pod
of the clearml webserver Pod and dump the output here?
So CLEARML8AGENT9KEY1234567890ABCD
is the actual real value you are using?
Can you try with these values? For instance the changes are: not using clearmlConfig, not overriding the image and use default, not defining resources
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
queue: default
webServerUrlReference:
clearml:
agentk8sglueKey: 8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT
agentk8sglueSecret: oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWear...
Hi @<1523708147405950976:profile|AntsyElk37> - There's a few points missing for the PR to be completed, let's follow-up on GitHub. See my comments here None