Reputation
Badges 1
43 × Eureka!yes, happy to help! In fact I am also interested in the k8s glue, since in one of our use cases we are using jobs and not pods (to allow for spot instances in the cloud), but I need to dig digger into the architecture to understand what we need exactly from k8s glue.
SuccessfulKoala55 20 minutes at least
and the experiment did not produce any logs, shall I enable some debug flag?
apiserver logs were clean, only 200s there
ok, there is probably a problem on my side, because when I ran the sample code from the repo it works, sorry to bother you
yes, but the local output was completely empty
AgitatedDove14 I meant the following scenario:
trains-agents will be running as slurm jobs (possibly for a very long time), there is a program running on an access-node of the cluster (where no computation happens, but from where one can submit jobs to slurm), this program check is there are not enough or too many agents running and adjusts them by cancelling them or spinning new ones.
that was quick, thanks!
AgitatedDove14 I do not want to push you in any way, but if you could give me an estimate of the slurm glue code, that would be helpful. I should have a local installation of the trains server to experiment with next week.
sure, we can deal with the drivers
I will only cancel daemons which are idle.