Hi @<1811208768843681792:profile|BraveGrasshopper38> , following up on your last message, are you running in an OpenShift k8s cluster?
Oh no worries, I understand 😄
Sure, if you could share the whole values and configs you're using to run both the server and agent that would be useful.
Also what about other Pods from the ClearML server, are there any other crash or similar error referring to a read-only filesystem? Are the server and agent installed on the same K8s node?
Please replace those credentials on the Agent and try upgrading the helm release
cat values-prod.yaml
agent:
api:
web_server: "
"
api_server: "
"
files_server: "
"
credentials:
access_key: "8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT"
secret_key: "oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWearo1S8EQ8eBOxu-opW8dVUU"
Also, in order to simplify the installation, can you use a simpler version of your values for now, something like this should work:
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
queue: default
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
webServerUrlReference:
clearml:
agentk8sglueKey: <NEW_KEY>
agentk8sglueSecret: <NEW_SECRET>
sessions:
externalIP: 192.168.70.211
maxServices: 5
startingPort: 30100
svcType: NodePort
Hi @<1857232027015712768:profile|PompousCrow47> , are you using pods with a read-only-filesystem limitation?
If i run helm get values clearml-agent -n clearml-prod
the output is the following:
USER-SUPPLIED VALUES:
agentk8sglue:
apiServerUrlReference: None
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference: None
image:
pullPolicy: Always
repository: allegroai/clearml-agent-k8s-base
tag: 1.25-1
queue: default
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
webServerUrlReference: None
clearml:
agentk8sglueKey: CLEARML8AGENT9KEY1234567890ABCD
agentk8sglueSecret: CLEARML-AGENT-SECRET-1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ123456
clearmlConfig: |-
api {
web_server: None
api_server: None
files_server: None
credentials {
"access_key" = "CLEARML8AGENT9KEY1234567890ABCD"
"secret_key" = "CLEARML-AGENT-SECRET-1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ123456"
}
}
sessions:
externalIP: 192.168.70.211
maxServices: 5
startingPort: 30100
svcType: NodePort
I will try :
1- update the agent with these values
2- run argo with those changes
for now:
- name: clearml-access-key
value: CLEARML8AGENT9KEY1234567890ABCD
- name: clearml-secret-key
value: CLEARML-AGENT-SECRET-1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ123456
- name: admin-password
value: clearml123!
Sorry we had a short delay on the deployment but
with these values:
clearml:
agentk8sglueKey: "8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT"
agentk8sglueSecret: "oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWearo1S8EQ8eBOxu-opW8dVUU"
clearmlConfig: |-
api {
web_server:
api_server:
files_server:
credentials {
"access_key" = "8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT"
"secret_key" = "oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWearo1S8EQ8eBOxu-opW8dVUU"
}
}
agentk8sglue:
# Try different image versions to avoid Python 3.6 regex issue
image:
repository: allegroai/clearml-agent-k8s-base
tag: "latest" # Use latest instead of specific version
pullPolicy: Always
# Essential server references
apiServerUrlReference: "
"
fileServerUrlReference: "
"
webServerUrlReference: "
"
# Disable certificate checking
clearmlcheckCertificate: false
# Queue configuration
queue: default
createQueueIfNotExists: true
# Minimal resources
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 256Mi
sessions:
svcType: NodePort
externalIP: 192.168.70.211
startingPort: 30100
maxServices: 5
EOF
The following commands
helm repo add clearml
helm repo update
helm install clearml-agent clearml/clearml-agent \
--namespace clearml-prod \
--values clearml-agent-values.yaml \
--wait \
--timeout 300s
"clearml" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "argo" chart repository
...Successfully got an update from the "clearml" chart repository
...Successfully got an update from the "harbor" chart repository
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
NAME: clearml-agent
LAST DEPLOYED: Mon Jul 21 15:11:38 2025
NAMESPACE: clearml-prod
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Glue Agent deployed.
just to check is this the intended image: docker.io/allegroai/clearml-agent-k8s-base:1.24-2
So CLEARML8AGENT9KEY1234567890ABCD
is the actual real value you are using?
I also see these logs:
bash
/root/entrypoint.sh: line 28: /root/clearml.conf: Read-only file system
This indicates that the container's filesystem is mounted as read-only , preventing the agent from writing its configuration file.
From
podSecurityContext:
readOnlyRootFilesystem: true # This causes the issue
PodSecurityPolicies
Security Context Constraints (OpenShift)
Admission controllers enforcing read-only filesystems