So if you now run helm get values clearml-agent -n <NAMESPACE> where <NAMESPACE> is the value you have in the $NS variable, can you confirm this is the full and only output? Of course the $VARIABLES will have their real value
agentk8sglue:
# Try newer image version to fix Python 3.6 regex issue
image:
repository: allegroai/clearml-agent-k8s-base
tag: "1.25-1"
pullPolicy: Always
apiServerUrlReference: "http://$NODE_IP:30008"
fileServerUrlReference: "ht...
In your last message, you are referring to pod security context and admission controllers enforcing some policies such as a read-only filesystem. Is that the case in your cluster?
Or was this some output of a GPT-like chat? If yes, please do not use LLMs to generate values for the helm installation as they're usually not providing a useful or real config
It's a bit hard for me to provide support here with the additional layer of Argo.
I assume the server is working fine and you can open the clearml UI and log in, right? If yes, would it be possible to extract the Agent part only, out of Argo, and proceed installing it through standard helm?
Please replace those credentials on the Agent and try upgrading the helm release
Hey @<1649221394904387584:profile|RattySparrow90> - You can try configuring CLEARML__logging__root__level as an extraEnvs for the apiserver and fileserver 🙂
@<1736194540286513152:profile|DeliciousSeaturtle82> when you copy the folder on the new pod, it crashes almost instantly?
Hi @<1798887585121046528:profile|WobblyFrog79> - Please try setting the environment variable CLEARML_K8S_GLUE_DEBUG=1 on the Agent
agentk8sglue:
extraEnvs:
- name: CLEARML_K8S_GLUE_DEBUG
value: "1"
This will make the Agent Pod print the rendered Task Pod template in the logs, so you can see it 🙂
@<1736194540286513152:profile|DeliciousSeaturtle82> the data folder for mongo4 and mongo5 might be slightly different. What is the target path where you're moving data in mongo5? And how is that mounted?
And when you say "broken", could you elaborate on that? Does the target Mongo Pod crash when trying to move the data? Or you succeed in copying the data but can't see the result in the UI?
Hey @<1523701304709353472:profile|OddShrimp85> - You can tweak the following section in the clearml-agent override values:
# -- Global parameters section
global:
# -- Images registry
imageRegistry: "docker.io"
# -- Private image registry configuration
imageCredentials:
# -- Use private authentication mode
enabled: true # <-- Set this to true
# -- Registry name
registry: docker.io
# -- Registry username
username: someone
# -- Registry password
password: pwd
# -- ...
@<1669152726245707776:profile|ManiacalParrot65> could you please send your values file override for the Agent helm chart?
@<1710827340621156352:profile|HungryFrog27> have you installed the Nvidia gpu-operator to advertise GPUs to Kubernetes?
Hi @<1523708147405950976:profile|AntsyElk37> - Yes, having the runtimeClass makes sense. I am handling your PR soon 🙂
Hi Amir, could you please share the values override that you used to install the clearml server helm chart?
value can be DEBUG, INFO, WARNING, ERROR, CRITICAL
Hey @<1734020156465614848:profile|ClearKitten90> - You can try with the following in your ClearML Agent override helm values. Make sure to replace mygitusername and git-password
agentk8sglue:
basePodTemplate:
env:
# to setup access to private repo, setup secret with git credentials
- name: CLEARML_AGENT_GIT_USER
value: mygitusername
- name: CLEARML_AGENT_GIT_PASS
valueFrom:
secretKeyRef:
name: git-password
...
Oh no worries, I understand 😄
Sure, if you could share the whole values and configs you're using to run both the server and agent that would be useful.
Also what about other Pods from the ClearML server, are there any other crash or similar error referring to a read-only filesystem? Are the server and agent installed on the same K8s node?
I assume the key and secret values here are redacted values and not the actual ones, right?
Also, in order to simplify the installation, can you use a simpler version of your values for now, something like this should work:
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
queue: default
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
webServerUrlReference:
clearml:
agentk8sglueKey: <NEW_KEY>...
Hi @<1811208768843681792:profile|BraveGrasshopper38> , following up on your last message, are you running in an OpenShift k8s cluster?
Hi @<1523708147405950976:profile|AntsyElk37> - There's a few points missing for the PR to be completed, let's follow-up on GitHub. See my comments here None
Oh, okay, not sure this will be the only issue but you'll need these credentials to be valid, since they are used by the ClearML Agent to connect to the ClearML Server 🙂
The easiest way to generate credentials is to open the ClearML UI in the browser, login with an Admin user, then navigate to the Settings located on the top right corner when clicking on the user icon. From there go to "Workspace" and click "Create new credentials" and use the value provided
Hey @<1734020208089108480:profile|WickedHare16> , could you please share your override values file for the clearml helm chart?
For tasks Pods running your experiments through the agent you can change the base image to something you like and have the Python version you need. You can use this section of the values:
agentk8sglue:
# -- default container image for ClearML Task pod
defaultContainerImage: ubuntu:18.04 # <-- Change me!!
Can you try with these values? For instance the changes are: not using clearmlConfig, not overriding the image and use default, not defining resources
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
queue: default
webServerUrlReference:
clearml:
agentk8sglueKey: 8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT
agentk8sglueSecret: oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWear...
For this to work you might also have to add "secure": true in the connection string object:
externalServices:
elasticsearchConnectionString: "[...,"secure":true,...]"