Reputation
Badges 1
53 × Eureka!It’s about strategy. If you have ClearML server installed on k8s I guess you want to run task on same k8s cluster. In this case using latest clearml-agent chart is the way to go that uses glue agent uinder the hood. Basically what happens is agent will spin new pod when a new task is enqueued in related queue. At this point it’s k8s duty to have enough resources to spawn the pod and this can be achieved in two ways:
you have enough resources already there you have a k8s autoscaler that can sp...
but you are starting from major for that is really old and where naming was potentially unconsistent some time so, my suggestion is to backup EVERY pv before proceeding
and are you sure you mongodb respawned on same node?
just my two cents
btw a good practice is to keep infrastructural stuff decoupled from applications. What about using https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner ? After applying that chart you can simply use the generated storage class; wdyt?
probably you will see it’s not capable of doing it and it should be related k8s config
With Helm we are not running in service-mode. If pod get evicted or killed we should investigate what is the reason behind that; there are any logs on kille dpod that can help us understand better the situation?
yes, exactly, agent creates and manages task pod lifecycle
ok got it, are you able to access the system bypassing nginx with http://<Server Address>:8080 ?
it would be great to get logs from apiserver and fileserver pods when deleting a file from ui so we can see what is going on. I’m saying this because, at first glance, I don’t see anyissue in your config
Today I’m OOO but I. An give an initial suggestion: when dealing with resource usage issues logs are important but metrics can help a lot more. If you don’t have it, install a Grafana stack so we can see resource metric history before we got oom . This helps to understand if we are really using a lot of RAM ore the problem is somewhere else.
(and any queue has it’s only basepodtemplate)
if you do a kubectl get svc in namspace you should see the svc of api webserver and fileserver
accessing apiserver from a pod doesn’t require kubeconfig
there are workarounds tbh but they are tricks that require a lot of k8s espertise and they are risky
at that point we define a queue and the agents will take care of training 😄
if it will not be updated and CI passed, I will have to create a new one when possible but I don’t have a timeframe for now
you will need to upgrade clearml helm chart
you can create a specific config like one in https://clear.ml/docs/latest/docs/integrations/storage/
Ty, I have other stuff that I'd like to send but it's better to get these eventually merged first so I can proceed to shiny news PR in the near future 😄
I absolutely need to improve the persistence part of this chart 😄
Hi everyone, I just fixed releases so new charts containing this fix are published. ty!