
Reputation
Badges 1
4 × Eureka!Hi CostlyOstrich36 . Would it also be possible to set those values through env vars?
Because I am using the https://github.com/allegroai/clearml-helm-charts/blob/06070a5c20691aaf83fc919b1bf07a822c212d5a/charts/clearml/values.yaml#L330 on Kubernetes and can thus far only configure it through env variables
If you have GPU autoscaling nodes in your k8s cluster already, you could also give the k8s glue agent a go https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L300 ?
With the correct tolerations/nodeselectors you can have k8s take care of the autoscaling for you by just spinning up a new pod
Worked like a charm, thanks SuccessfulKoala55 !!! 😄
Nice!
One small remark, the contributions from open-source are not mentioned in the release notes 😇
Hi Luca. We have ClearML deployed through ArgoCD and have the following configs:
Chart.yaml
` apiVersion: v2
name: clearml
description: A Helm chart for Kubernetes
version: 0.0.1
dependencies:
- name: clearml
version: 3.5.1
repository:http://values.dev .yaml
# insert your own config here `
Both of above files are pushed into our own private gitops repository.
Apply the following file with kubectl:
clearml-argocd.yaml
` apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
...
No problem! Thank you for finding a bug in the chart 🤓
I have some other improvements to the k8sagent I want to submit a PR for soon, so be sure the monitor the chart repo for updates!
Should have been tolerations: []
, I'll send a PR soon to fix it.
In the meantime you can solve it by setting the value to k8sagent.podTemplate.tolerations: []
Ah I see it! I made a mistake in the helm chart 🙈
Could be the cause of your error
As you'll probably run into issues as soon you want to start running experiments from private repos
Because if not, the k8sagent pod is still using the old version
Do not go with the AWS managed mongo & ES, both will not work I'm afraid and are a pain to setup, speaking from experience
Oh btw, did you restart the k8sagent pod after applying the new template?
SmugHippopotamus96 the new version of the helm chart should fix all the issues you mentioned!
Thanks for your swift response CostlyOstrich36 !
We're a startup where about 10 people will use ClearML as the experiment logging backend with agents running on 4 on-prem GPU machines. We strive to always have experiments running to not have idle GPUs but this isn't always the case.
Alright so the Redis instance is too mission critical (I'll probably deploy this with the helm chart). The mongo and elastic are necessary and I'd like to deploy these as managed instances in AWS. Do you have ...
https://github.com/elastic/elasticsearch-py/issues/1666
And this you'll run into ☝