value can be DEBUG, INFO, WARNING, ERROR, CRITICAL
@<1734020208089108480:profile|WickedHare16> - please try configuring the cookieDomain
clearml:
cookieDomain: ""
You should set it as your base domain, example pixis.internal
, without any api or files in front of it
Do you mean the Python version that is installed on the clearml agent itself? Or do you mean the Python version available in tasks that will be run from the agent?
For tasks Pods running your experiments through the agent you can change the base image to something you like and have the Python version you need. You can use this section of the values:
agentk8sglue:
# -- default container image for ClearML Task pod
defaultContainerImage: ubuntu:18.04 # <-- Change me!!
Hey @<1734020208089108480:profile|WickedHare16> - Not 100% sure this is the issue, but I noticed a wrong configuration in your values.
You configured both these:
elasticsearch:
enabled: true
externalServices:
# -- Existing ElasticSearch connectionstring if elasticsearch.enabled is false (example in values.yaml)
elasticsearchConnectionString: "[{\"host\":\"es_hostname1\",\"port\":9200},{\"host\":\"es_hostname2\",\"port\":9200},{\"host\":\"es_hostname3\",\"port\":9200}]"
Pl...
Hi @<1798162812862730240:profile|PreciousCentipede43> 🙂
When you say
one of the api backend is UNHEALTHY
do you mean you have multiple replicas of the apiserver component (i.e. you set the values apiserver.replicaCount
> 1) and one of them is not ready?
Could you please share the output of the kubectl describe command for the ClearML apiserver Deployment?
Hey @<1743079861380976640:profile|HighKitten20> - Try to configure this section in the values override file for the Agent helm chart:
# -- Private image registry configuration
imageCredentials:
# -- Use private authentication mode
enabled: false
# -- If this is set, chart will not generate a secret but will use what is defined here
existingSecret: ""
# -- Registry name
registry: docker.io
# -- Registry username
username: someone
# -- Registry password
password: pwd...
Hey @<1734020156465614848:profile|ClearKitten90> - You can try with the following in your ClearML Agent override helm values. Make sure to replace mygitusername
and git-password
agentk8sglue:
basePodTemplate:
env:
# to setup access to private repo, setup secret with git credentials
- name: CLEARML_AGENT_GIT_USER
value: mygitusername
- name: CLEARML_AGENT_GIT_PASS
valueFrom:
secretKeyRef:
name: git-password
...
Hey @<1734020208089108480:profile|WickedHare16> , could you please share your override values file for the clearml helm chart?
Hello @<1523708147405950976:profile|AntsyElk37> 🙂
You are right, the spec.runtimeClassName
field is not supported in the Agent at the moment, I'll work on your Pull Request ASAP.
Could you elaborate a bit about why you need Tasks Pods to specify the runtimeclass to use GPUs?
Usually, you'd need to specify a Pod's container with, for example, resources.limits.nvidia.com/gpu
: 1
, and the Nvidia Device Plugin would itself assign the correct device to the container. Will that work?
Hi @<1523708147405950976:profile|AntsyElk37> - There's a few points missing for the PR to be completed, let's follow-up on GitHub. See my comments here None
Hi @<1523708147405950976:profile|AntsyElk37> - Yes, having the runtimeClass makes sense. I am handling your PR soon 🙂
Hi @<1811208768843681792:profile|BraveGrasshopper38> , following up on your last message, are you running in an OpenShift k8s cluster?
So if you now run helm get values clearml-agent -n <NAMESPACE>
where <NAMESPACE>
is the value you have in the $NS
variable, can you confirm this is the full and only output? Of course the $VARIABLES
will have their real value
agentk8sglue:
# Try newer image version to fix Python 3.6 regex issue
image:
repository: allegroai/clearml-agent-k8s-base
tag: "1.25-1"
pullPolicy: Always
apiServerUrlReference: "http://$NODE_IP:30008"
fileServerUrlReference: "ht...
Can you try with these values? For instance the changes are: not using clearmlConfig, not overriding the image and use default, not defining resources
agentk8sglue:
apiServerUrlReference:
clearmlcheckCertificate: false
createQueueIfNotExists: true
fileServerUrlReference:
queue: default
webServerUrlReference:
clearml:
agentk8sglueKey: 8888TMDLWYY7ZQJJ0I7R2X2RSP8XFT
agentk8sglueSecret: oNODbBkDGhcDscTENQyr-GM0cE8IO7xmpaPdqyfsfaWear...
Hi @<1523701907598610432:profile|ReassuredArcticwolf33> - Are you referring to the clearml
helm chart or to the clearml-agent
one?
Either case, the respective values.yaml
file is self-documented and contains example. Here I am reporting an example for adding additional volumes and volume mounts to the apiserver component of the clearml chart:
apiserver:
# -- # Defines extra Kubernetes volumes to be attached to the pod.
additionalVolumes:
- name: ramdisk
empty...
Hi @<1798162812862730240:profile|PreciousCentipede43> 🙂
- Regarding bypassing the IAP I am not sure. Could you elaborate a bit? Do you have some expected solution in mind?
- For exposing the interactive sessions you can use a LoadBalancer config as mentioned (if your cloud provider supports its configuration) or use a NodePort service type (making sure there is no firewall rules and you can access the defined ports on the Nodes). Exposing the sessions through an Ingress is supported in t...
Hey @<1649221394904387584:profile|RattySparrow90> - You can try configuring CLEARML__logging__root__level
as an extraEnvs for the apiserver and fileserver 🙂
It's a bit hard for me to provide support here with the additional layer of Argo.
I assume the server is working fine and you can open the clearml UI and log in, right? If yes, would it be possible to extract the Agent part only, out of Argo, and proceed installing it through standard helm?
Hey @<1523701304709353472:profile|OddShrimp85> - You can tweak the following section in the clearml-agent override values:
# -- Global parameters section
global:
# -- Images registry
imageRegistry: "docker.io"
# -- Private image registry configuration
imageCredentials:
# -- Use private authentication mode
enabled: true # <-- Set this to true
# -- Registry name
registry: docker.io
# -- Registry username
username: someone
# -- Registry password
password: pwd
# -- ...
Oh no worries, I understand 😄
Sure, if you could share the whole values and configs you're using to run both the server and agent that would be useful.
Also what about other Pods from the ClearML server, are there any other crash or similar error referring to a read-only filesystem? Are the server and agent installed on the same K8s node?
I understand, I'd just like to make sure if that's the root issue and there's no other bug, and if so then you can think of how to automate it via API
Hi @<1843461294267568128:profile|KindArcticwolf58> - How did you execute this task?
The k8s_scheduler
queue is an internal queue, not intended to be used for enqueuing task. I see you have configured the Agent to watch the gpu
queue. Please make sure to create a queue with the same name on the control plane from the UI and restart the Agent, then enqueue the Task on this queue.
@<1710827340621156352:profile|HungryFrog27> have you installed the Nvidia gpu-operator to advertise GPUs to Kubernetes?
@<1752864322440138752:profile|GiddyDragonfly90> - I think you can also add verify_certs: false
in the same elasticsearchConnectionString
object, have you tried?
I see, in the example you provided you used a comma ,
to separate username and password, I suggest trying to use a column :
🙂 let me know if that works for you
If that doesn't work, try removing the auth from the connection string and instead define two extraEnvs
for the apiserver
:
apiserver:
extraEnvs:
- name: CLEARML_ELASTIC_SERVICE_USERNAME
value: "elastic"
- name: CLEARML_ELASTIC_SERVICE_PASSWORD
value: "toto"
@<1752864322440138752:profile|GiddyDragonfly90> - MongoDB is used as a dependency Helm Chart from the Bitnami repo. We are using version 12.1.31 of the chart. See this tag None
In the clearml override values, under the mongodb
section you can specify any value that is usable in the original chart 🙂