Reputation
Badges 1
53 × Eureka!In this case I apologize for confusion. If you are going for AWS autoscaler it's better to follow official way to go, the solution I proposed is for an onpremise cluster containing every componenet without autoscaler. sorry for
moreover if you are using minikube you can take a try on official helm chart https://github.com/allegroai/clearml-server-helm
in values.yaml I guess apiServerUrlReference is wrong
Hi @<1580005316538404864:profile|StickyOwl15> , using EKS should be straightforward even with standard values; you will eventually have to play a little bit with ingress annotations since AWS has a lot of them. The very important part is preparing your cluster before installing ClearML; usually you need to install elb controller, secret manager, dns controller and so on. When done you can install clearml with helm chart; about secrets management, it depends a lot on how you plan to manage c...
it looks to me redis pod is not working as expected, it’s just a guess
it’s also true I never use local storage since it’s not production reliable
Ok, let’s try to deep dive into it, what is the Helm chart version used for this deployment?
it’s weird, can you pls open a bug in clearml-helm-charts repo?
ReassuredArcticwolf33 pls try this one and let me know if this is working for you
just a couple of info
what kin d of Clearml installation you did on machine? there are processes listening on these ports?
I suggest to exec into the pod and issue the command kubectl delete pod -l=CLEARML=agent-74b23a8f --namespace=clearml --field-selector=status.phase!=Pending,status.phase!=Running --output name
sp you can see the ouput from inside the pod. This should help understand what is going on with the command
I need to evaluate a better way to handle image versioning for the future, btw latest version should be fixed. Apologies folks!
AgitatedDove14 trainsConfig is totally optional and you can put the config file itself in it.. e.g.:
` trainsConfig: |-
sdk {
aws {
s3 {
key: ""
secret: ""
region: ""
credentials: [
{
host: "minio.minio:9000"
key: "DEMOaccessKey"
secret: "DEMOsecretKey"
...
how you cluster reacts is about scaling infra as much as needed (karpenter or any other cloud autoscaler should work)
iptables is used by docker itself so you need to be careful on doing mods: https://docs.docker.com/network/packet-filtering-firewalls/
it will be easier for me to reproduce
btw, judging from screenshots services are ok but pod are not up, especially elastic, redis and mongodb are Pending
so it means k8s didn’t scheduled them for some reason you can find describing these pods
maybe this can cause the issue?
This is specific K8s infra management, usually I use Velero for backup
or maybe default?
there are workarounds tbh but they are tricks that require a lot of k8s espertise and they are risky
I guess apiServerUrlReference should be fixed
I ’ m not totally sure atm but you can try to set env var CLEARML_API_HOST_VERIFY_CERT="false"
this is basic k8s management that is not strictly related this chart. my suggestion is to have a default storageclass that will be able to provide the right pv/pvc for any deployment you are going to have on the cluster. I suggest to start from here: https://kubernetes.io/docs/concepts/storage/storage-classes/
there’s a PR coming with example values: https://github.com/allegroai/clearml-helm-charts/pull/234