The value field is a default argo falls back into if i dont provide any
jcarvalho@kharrinhao:~$ kubectl get pods -n clearml-prod -l app.kubernetes.io/name=clearml-agent
NAME READY STATUS RESTARTS AGE
clearml-agent-547584497c-xf98z 0/1 Error 4 (60s ago) 2m8s
jcarvalho@kharrinhao:~$ kubectl logs -n clearml-prod -l app.kubernetes.io/name=clearml-agent
Defaulted container "k8s-glue" out of: k8s-glue, init-k8s-glue (init)
not nested and not items))
File "/usr/lib/python3.6/sre_parse.py", line 765, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "/usr/lib/python3.6/sre_parse.py", line 734, in _parse
flags = _parse_flags(source, state, char)
File "/usr/lib/python3.6/sre_parse.py", line 803, in _parse_flags
raise source.error("bad inline flags: cannot turn on global flag", 1)
sre_constants.error: bad inline flags: cannot turn on global flag at position 92 (line 4, column 20)
jcarvalho@kharrinhao:~$
I understand, I'd just like to make sure if that's the root issue and there's no other bug, and if so then you can think of how to automate it via API
I had no issues deploying via the Github but helm is quite more confusing
So CLEARML8AGENT9KEY1234567890ABCD
is the actual real value you are using?
I will try :
1- update the agent with these values
2- run argo with those changes
cat values-prod.yaml
agent:
api:
web_server: "
"
api_server: "
"
files_server: "
"
credentials:
access_key: "8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT"
secret_key: "oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWearo1S8EQ8eBOxu-opW8dVUU"
Also, in order to simplify the installation, can you use a simpler version of your values for now, something like this should work:
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
queue: default
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
webServerUrlReference:
clearml:
agentk8sglueKey: <NEW_KEY>
agentk8sglueSecret: <NEW_SECRET>
sessions:
externalIP: 192.168.70.211
maxServices: 5
startingPort: 30100
svcType: NodePort
In your last message, you are referring to pod security context and admission controllers enforcing some policies such as a read-only filesystem. Is that the case in your cluster?
Or was this some output of a GPT-like chat? If yes, please do not use LLMs to generate values for the helm installation as they're usually not providing a useful or real config
I assume the key and secret values here are redacted values and not the actual ones, right?
for now:
- name: clearml-access-key
value: CLEARML8AGENT9KEY1234567890ABCD
- name: clearml-secret-key
value: CLEARML-AGENT-SECRET-1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ123456
- name: admin-password
value: clearml123!
just to check is this the intended image: docker.io/allegroai/clearml-agent-k8s-base:1.24-2
Hi @<1857232027015712768:profile|PompousCrow47> , are you using pods with a read-only-filesystem limitation?
Please replace those credentials on the Agent and try upgrading the helm release
Hi! Im using just a plain Kubernetes cluster (kubeadm) running on Proxmox VM, and im using Argo to deploy the helm, in order to standarize it Let me know if you need any more details!
kubectl describe pod -n clearml-prod -l app.kubernetes.io/name=clearml-agent
kubectl logs -n clearml-prod -l app.kubernetes.io/name=clearml-agent --previous 2>/dev/null || true
Name: clearml-agent-848875fbdc-x8x6s
Namespace: clearml-prod
Priority: 0
Service Account: clearml-agent-sa
Node: kharrinhao/192.168.70.211
Start Time: Mon, 21 Jul 2025 15:23:02 +0000
Labels: app.kubernetes.io/instance=clearml-agent
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=clearml-agent
app.kubernetes.io/version=1.24
helm.sh/chart=clearml-agent-5.3.3
pod-template-hash=848875fbdc
Annotations: checksum/config: 5c1b50a353fea7ffd1fa5e62f968edc92e2610e0f0fd7783900a44f899ebe9ca
cni.projectcalico.org/containerID: 6964e25aa0cf54fa1dc91e36648d97e6deeae3366a924579be1e72742a25365a
cni.projectcalico.org/podIP: 192.168.31.162/32
cni.projectcalico.org/podIPs: 192.168.31.162/32
Status: Running
IP: 192.168.31.162
IPs:
IP: 192.168.31.162
Controlled By: ReplicaSet/clearml-agent-848875fbdc
Init Containers:
init-k8s-glue:
Container ID:
5
Image: docker.io/allegroai/clearml-agent-k8s-base:1.24-21
Image ID: docker.io/allegroai/clearml-agent-k8s-base@sha256:772827a01bb5a4fff5941980634c8afa55d1d6bbf3ad805ccd4edafef6090f28
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
set -x; while [ $(curl --insecure -sw '%{http_code}' "
" -o /dev/null) -ne 200 ] ; do
echo "waiting for apiserver" ;
sleep 5 ;
done; while [[ $(curl --insecure -sw '%{http_code}' "
" -o /dev/null) =~ 403|405 ]] ; do
echo "waiting for fileserver" ;
sleep 5 ;
done; while [ $(curl --insecure -sw '%{http_code}' "
" -o /dev/null) -ne 200 ] ; do
echo "waiting for webserver" ;
sleep 5 ;
done
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 21 Jul 2025 15:23:03 +0000
Finished: Mon, 21 Jul 2025 15:23:03 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7f2zt (ro)
Containers:
k8s-glue:
Container ID:
6
Image: docker.io/allegroai/clearml-agent-k8s-base:1.24-21
Image ID: docker.io/allegroai/clearml-agent-k8s-base@sha256:772827a01bb5a4fff5941980634c8afa55d1d6bbf3ad805ccd4edafef6090f28
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
export PATH=$PATH:$HOME/bin; source /root/.bashrc && /root/entrypoint.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 21 Jul 2025 15:23:58 +0000
Finished: Mon, 21 Jul 2025 15:24:02 +0000
Ready: False
Restart Count: 3
Environment:
CLEARML_API_HOST:
CLEARML_WEB_HOST:
CLEARML_FILES_HOST:
CLEARML_API_HOST_VERIFY_CERT: false
K8S_GLUE_EXTRA_ARGS: --namespace clearml-prod --template-yaml /root/template/template.yaml --create-queue
CLEARML_CONFIG_FILE: /root/clearml.conf
K8S_DEFAULT_NAMESPACE: clearml-prod
CLEARML_API_ACCESS_KEY: <set to the key 'agentk8sglue_key' in secret 'clearml-agent-ac'> Optional: false
CLEARML_API_SECRET_KEY: <set to the key 'agentk8sglue_secret' in secret 'clearml-agent-ac'> Optional: false
CLEARML_WORKER_ID: clearml-agent
CLEARML_AGENT_UPDATE_REPO:
FORCE_CLEARML_AGENT_REPO:
CLEARML_DOCKER_IMAGE: ubuntu:18.04
K8S_GLUE_QUEUE: default
Mounts:
/root/clearml.conf from k8sagent-clearml-conf-volume (ro,path="clearml.conf")
/root/template from clearml-agent-pt (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7f2zt (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
clearml-agent-pt:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: clearml-agent-pt
Optional: false
k8sagent-clearml-conf-volume:
Type: Secret (a volume populated by a Secret)
SecretName: clearml-agent-ac
Optional: false
kube-api-access-7f2zt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 96s default-scheduler Successfully assigned clearml-prod/clearml-agent-848875fbdc-x8x6s to kharrinhao
Normal Pulled 95s kubelet Container image "docker.io/allegroai/clearml-agent-k8s-base:1.24-21" already present on machine
Normal Created 95s kubelet Created container: init-k8s-glue
Normal Started 95s kubelet Started container init-k8s-glue
Normal Pulled 40s (x4 over 94s) kubelet Container image "docker.io/allegroai/clearml-agent-k8s-base:1.24-21" already present on machine
Normal Created 40s (x4 over 94s) kubelet Created container: k8s-glue
Normal Started 40s (x4 over 93s) kubelet Started container k8s-glue
Warning BackOff 10s (x6 over 84s) kubelet Back-off restarting failed container k8s-glue in pod clearml-agent-848875fbdc-x8x6s_clearml-prod(42a51ff8-6423-485a-89e3-6109b3c0583a)
not nested and not items))
File "/usr/lib/python3.6/sre_parse.py", line 765, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "/usr/lib/python3.6/sre_parse.py", line 734, in _parse
flags = _parse_flags(source, state, char)
File "/usr/lib/python3.6/sre_parse.py", line 803, in _parse_flags
raise source.error("bad inline flags: cannot turn on global flag", 1)
sre_constants.error: bad inline flags: cannot turn on global flag at position 92 (line 4, column 20)
Sorry we had a short delay on the deployment but
with these values:
clearml:
agentk8sglueKey: "8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT"
agentk8sglueSecret: "oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWearo1S8EQ8eBOxu-opW8dVUU"
clearmlConfig: |-
api {
web_server:
api_server:
files_server:
credentials {
"access_key" = "8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT"
"secret_key" = "oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWearo1S8EQ8eBOxu-opW8dVUU"
}
}
agentk8sglue:
# Try different image versions to avoid Python 3.6 regex issue
image:
repository: allegroai/clearml-agent-k8s-base
tag: "latest" # Use latest instead of specific version
pullPolicy: Always
# Essential server references
apiServerUrlReference: "
"
fileServerUrlReference: "
"
webServerUrlReference: "
"
# Disable certificate checking
clearmlcheckCertificate: false
# Queue configuration
queue: default
createQueueIfNotExists: true
# Minimal resources
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
sessions:
svcType: NodePort
externalIP: 192.168.70.211
startingPort: 30100
maxServices: 5
EOF
The following commands
helm repo add clearml
helm repo update
helm install clearml-agent clearml/clearml-agent \
--namespace clearml-prod \
--values clearml-agent-values.yaml \
--wait \
--timeout 300s
"clearml" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "argo" chart repository
...Successfully got an update from the "clearml" chart repository
...Successfully got an update from the "harbor" chart repository
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
NAME: clearml-agent
LAST DEPLOYED: Mon Jul 21 15:11:38 2025
NAMESPACE: clearml-prod
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Glue Agent deployed.
Can you try with these values? For instance the changes are: not using clearmlConfig, not overriding the image and use default, not defining resources
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
queue: default
webServerUrlReference:
clearml:
agentk8sglueKey: 8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT
agentk8sglueSecret: oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWearo1S8EQ8eBOxu-opW8dVUU
sessions:
externalIP: 192.168.70.211
maxServices: 5
startingPort: 30100
svcType: NodePort
I had those setted on the config file, but i can provide you what i am using for server and agent config if it helps. I got lost on the configs so i tried everything 🤣
So if you now run helm get values clearml-agent -n <NAMESPACE>
where <NAMESPACE>
is the value you have in the $NS
variable, can you confirm this is the full and only output? Of course the $VARIABLES
will have their real value
agentk8sglue:
# Try newer image version to fix Python 3.6 regex issue
image:
repository: allegroai/clearml-agent-k8s-base
tag: "1.25-1"
pullPolicy: Always
apiServerUrlReference: "http://$NODE_IP:30008"
fileServerUrlReference: "http://$NODE_IP:30081"
webServerUrlReference: "http://$NODE_IP:30080"
clearmlcheckCertificate: false
queue: default
createQueueIfNotExists: true
# Keep resources minimal for testing
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
sessions:
svcType: NodePort
externalIP: $NODE_IP
startingPort: 30100
maxServices: 5
Hi @<1811208768843681792:profile|BraveGrasshopper38> , following up on your last message, are you running in an OpenShift k8s cluster?
As far as i can test, the server is going ok, i had some isses with resources not loading but solved those. The bigger issue for now is agent and prob could propagate to the serving. Later on i plan on adding also gpu resouces to both so im not entirely sure on that part
clearml-apiserver-866ccf75f7-zr5wx 1/1 Running 0 37m
clearml-apiserver-asyncdelete-8dfb574b8-8gbcv 1/1 Running 0 37m
clearml-elastic-master-0 1/1 Running 0 37m
clearml-fileserver-86b8ddf6f6-4xnqd 1/1 Running 0 37m
clearml-mongodb-5f995fbb5-xmdpb 1/1 Running 0 37m
clearml-redis-master-0 1/1 Running 0 37m
clearml-webserver-c487cfcb-vv5z5 1/1 Running 0 37m
If i run helm get values clearml-agent -n clearml-prod
the output is the following:
USER-SUPPLIED VALUES:
agentk8sglue:
apiServerUrlReference: None
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference: None
image:
pullPolicy: Always
repository: allegroai/clearml-agent-k8s-base
tag: 1.25-1
queue: default
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
webServerUrlReference: None
clearml:
agentk8sglueKey: CLEARML8AGENT9KEY1234567890ABCD
agentk8sglueSecret: CLEARML-AGENT-SECRET-1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ123456
clearmlConfig: |-
api {
web_server: None
api_server: None
files_server: None
credentials {
"access_key" = "CLEARML8AGENT9KEY1234567890ABCD"
"secret_key" = "CLEARML-AGENT-SECRET-1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ123456"
}
}
sessions:
externalIP: 192.168.70.211
maxServices: 5
startingPort: 30100
svcType: NodePort
with the values on helm
helm get values clearml-agent -n clearml-prod
USER-SUPPLIED VALUES:
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
image:
pullPolicy: Always
repository: allegroai/clearml-agent-k8s-base
tag: latest
queue: default
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
webServerUrlReference:
clearml:
agentk8sglueKey: 8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT
agentk8sglueSecret: oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWearo1S8EQ8eBOxu-opW8dVUU
clearmlConfig: |-
api {
web_server:
api_server:
files_server:
credentials {
"access_key" = "8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT"
"secret_key" = "oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWearo1S8EQ8eBOxu-opW8dVUU"
}
}
sessions:
externalIP: 192.168.70.211
maxServices: 5
startingPort: 30100
svcType: NodePort
jcarvalho@kharrinhao:~$
parameters:
- name: namespace
value: clearml-prod
- name: node-ip
value: "192.168.70.211"
- name: force-cleanup
value: "false"
- name: install-server
value: "true"
- name: install-agent
value: "true"
- name: install-serving
value: "true"
- name: diagnose-only
value: "false"
- name: storage-class
value: openebs-hostpath
- name: helm-timeout
value: 900s
- name: clearml-access-key
value: CLEARML8AGENT9KEY1234567890ABCD
- name: clearml-secret-key
value: CLEARML-AGENT-SECRET-1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ123456
- name: admin-password
value: clearml123!
Yeah i know.. thats what i did for the github implementation, but for this i need them to be generated on the fly or via CLI that i can use argo to create if thats possible
Hi, im trying to add the agent to a running server and facing the same issue.
Defaulted container "k8s-glue" out of: k8s-glue, init-k8s-glue (init)
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.6/sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.6/sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "/usr/lib/python3.6/sre_parse.py", line 765, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "/usr/lib/python3.6/sre_parse.py", line 765, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "/usr/lib/python3.6/sre_parse.py", line 734, in _parse
flags = _parse_flags(source, state, char)
File "/usr/lib/python3.6/sre_parse.py", line 803, in _parse_flags
raise source.error("bad inline flags: cannot turn on global flag", 1)
sre_constants.error: bad inline flags: cannot turn on global flag at position 92 (line 4, column 20)