Reputation
Badges 1
53 × Eureka!this is a connection fail from agent to apiserver. the flow should be aget-pod -> apiserver svc -> apiserver pod. maybe also apiserver can have something in ogs that can be checked
this is the state of the cluster https://github.com/valeriano-manassero/mlops-k8s-infra
especially if it’s evicted, it should be due increasing resource usage
it would be great to get logs from apiserver and fileserver pods when deleting a file from ui so we can see what is going on. I’m saying this because, at first glance, I don’t see anyissue in your config
Hi, not really sure if these is any problem with Github CDN but it looks fine to me right now: https://github.com/allegroai/clearml-helm-charts/issues/155
so you are using docker-compose?
this one should not be needed for asyncdelete, what is the error you are getting?
it’s weird, can you pls open a bug in clearml-helm-charts repo?
ok so they are executed as expected
pretty strange, I also noticed in example from line 2from clearml import TaskTypes
ok so you are on chart major verison 4 while we are now on 6. Let me check one minute pls
this should be the form that works on Helm
if they are in kubernetes you can simply use k8s glue
Today I’m OOO but I. An give an initial suggestion: when dealing with resource usage issues logs are important but metrics can help a lot more. If you don’t have it, install a Grafana stack so we can see resource metric history before we got oom . This helps to understand if we are really using a lot of RAM ore the problem is somewhere else.
in some second it should became green
this is the PR: https://github.com/allegroai/clearml-helm-charts/pull/80 https://github.com/allegroai/clearml-helm-charts/pull/80 will merge it soon so agent chart 1.0.1 will be released
still need time because I have two very busy days
ReassuredArcticwolf33 pls try this one and let me know if this is working for you
and add these 3 hostnames pointing them to the external ip
ok got it, are you able to access the system bypassing nginx with http://<Server Address>:8080 ?
I don’t think you need to pass these env vars in extraenvs, references are automatically generated by chart. After removing them, pls post webserver pod logs here and let’s see if we can spot the issue, ty.
O k, I’d like to test it more with you; credentials exposed in chart values are system ones and it’s better to not change them; let’s forget about them for now. If you create a new accesskey/secretkey pair in ui, you should use these ones in your agents and they shuld not get overwritten in any way; can you confirm it works without touching credentials section?
i will release a new chart version with no need to set a default storage class like I asked you to do today