Reputation
Badges 1
4 × Eureka!Thanks for your swift response CostlyOstrich36 !
We're a startup where about 10 people will use ClearML as the experiment logging backend with agents running on 4 on-prem GPU machines. We strive to always have experiments running to not have idle GPUs but this isn't always the case.
Alright so the Redis instance is too mission critical (I'll probably deploy this with the helm chart). The mongo and elastic are necessary and I'd like to deploy these as managed instances in AWS. Do you have ...
Hi Luca. We have ClearML deployed through ArgoCD and have the following configs:
Chart.yaml
` apiVersion: v2
name: clearml
description: A Helm chart for Kubernetes
version: 0.0.1
dependencies:
- name: clearml
version: 3.5.1
repository:http://values.dev .yaml
# insert your own config here `
Both of above files are pushed into our own private gitops repository.
Apply the following file with kubectl:
clearml-argocd.yaml
` apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
...
Hi CostlyOstrich36 . Would it also be possible to set those values through env vars?
Because I am using the https://github.com/allegroai/clearml-helm-charts/blob/06070a5c20691aaf83fc919b1bf07a822c212d5a/charts/clearml/values.yaml#L330 on Kubernetes and can thus far only configure it through env variables
If you have GPU autoscaling nodes in your k8s cluster already, you could also give the k8s glue agent a go https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L300 ?
With the correct tolerations/nodeselectors you can have k8s take care of the autoscaling for you by just spinning up a new pod
Could be the cause of your error
As you'll probably run into issues as soon you want to start running experiments from private repos
Ah I see it! I made a mistake in the helm chart 🙈
No problem! Thank you for finding a bug in the chart 🤓
I have some other improvements to the k8sagent I want to submit a PR for soon, so be sure the monitor the chart repo for updates!
SmugHippopotamus96 the new version of the helm chart should fix all the issues you mentioned!
Because if not, the k8sagent pod is still using the old version
Oh btw, did you restart the k8sagent pod after applying the new template?
Nice!
One small remark, the contributions from open-source are not mentioned in the release notes 😇
https://github.com/elastic/elasticsearch-py/issues/1666
And this you'll run into ☝
Do not go with the AWS managed mongo & ES, both will not work I'm afraid and are a pain to setup, speaking from experience
Worked like a charm, thanks SuccessfulKoala55 !!! 😄
Should have been tolerations: []
, I'll send a PR soon to fix it.
In the meantime you can solve it by setting the value to k8sagent.podTemplate.tolerations: []