With Helm we are not running in service-mode. If pod get evicted or killed we should investigate what is the reason behind that; there are any logs on kille dpod that can help us understand better the situation?
Hi GrittyCormorant73
At the end everything goes through session.send, you can add a print there?
btw: why would you print all the requests? what are we debugging here?
? Do you have a link how to setup a task scheduler to run in service mode in k8s?
basically spin the agent pod and add an argument to the agent itself (this is the --service-mode)
https://clear.ml/docs/latest/docs/clearml_agent#services-mode
I think JuicyFox94 is maintaining the clearml helm charts. Can we specify service mode for the helm chart? thank you
AgitatedDove14 thank you. I'll try that.
I deployed the agents using helm
especially if it’s evicted, it should be due increasing resource usage
I'm not sure how the helm is built but do we have a "services queue" on the helm?
ok but describing the pod you should have, at least, the Ending cause
hi AgitatedDove14 do you mean I insert code in the clearml package itself?
The long story is I tried to create task scheduler and my clearml agent running on k8s. so the scheduler r un as a pod. But I found out the pod cant run for very long time. It may have been killed or evicted or something after a day or 2.
So Im thinking I might need to create my own scheduler. So I would like to to know what it send to the server to create the task/pipeline, so I can just replicate the http api request. Instead of pulling all the code and install package and run the python code.
AgitatedDove14 oh, when I deploy the agents on k8s (using helm), I see them run reliably (no killed), are they running in service mode? Do you have a link how to setup a task scheduler to run in service mode in k8s? is it similar to the clearml agent? (from my understanding, the agent also listen to a queue and spin a new pod to handle incoming tasks on the queue)
JuicyFox94 I'll need to check with my infra team on that. when the pod get killed, I cant access any log on my rancher end. On clearml server, it simply show the pod stop communcate with the server. no error
It may have been killed or evicted or something after a day or 2.
Actually the ideal setup is to have a "services" pod running all these service on a single pod, with clearml-agent --services-mode. This Pod should always be on and pull jobs from a dedicated queue.
Maybe a nice way to do that is to have the single Task serialize itself, then have the a Pod run the Task every X hours and spin it down
So I would like to to know what it send to the server to create the task/pipeline, so I can just replicate the http api request. Instead of pulling all the code and install package and run the python code.
Oh, I would just use the pythonic interface to do that, instead of the raw Rest API
I see them run reliably (no killed), are they running in service mode?
How do you deploy agents, with the clearml k8s glue ?