
Reputation
Badges 1
43 × Eureka!the images do not show up in debug_samples. How can I check what is wrong?
since I am using the demo server, I should make sure that in the configuration file the images will be uploaded to the appropriate server, right? Can you please point me to the proper line in the config file?
ok, there is probably a problem on my side, because when I ran the sample code from the repo it works, sorry to bother you
have some kind of an add-on not as a widget but in an external system (this is not the preffered way of course)
that was quick, thanks!
No, they were not SuccessfulKoala55
I will only cancel daemons which are idle.
AgitatedDove14 going back to the slurm subject, I have local trains installed on the cluster with slurm so I am ready to test. At the same time I was thinking whether a simple solution would do the job:
a) [scale up agents] monitor the trains queue, if there is something that was not started for some amount of time, and the number of agents is not yet at the maximum, then add an agent,
b) [scale down agents] if all the tasks are running and there are idle agents, kill an idle agent.
Or do yo...
AgitatedDove14 I meant the following scenario:
trains-agents will be running as slurm jobs (possibly for a very long time), there is a program running on an access-node of the cluster (where no computation happens, but from where one can submit jobs to slurm), this program check is there are not enough or too many agents running and adjusts them by cancelling them or spinning new ones.
yes, but the local output was completely empty
Unfortunately there is no docker, there is only singularity. This cluster is used by many users and docker is not secure enough.
that's ok, I think that the race condition will be a non-issue. Thanks for checking!
AgitatedDove14 Is there a way to say to a worker that it should not take new tasks? If there is such a feature then one could avoid the race condition.
But I can't do this from the web ui, can I?
I hope you can do this without containers.
SuccessfulKoala55 20 minutes at least
thanks, next time I will provide you will all the logs
and the experiment did not produce any logs, shall I enable some debug flag?
sure, we can deal with the drivers
yes, happy to help! In fact I am also interested in the k8s glue, since in one of our use cases we are using jobs and not pods (to allow for spot instances in the cloud), but I need to dig digger into the architecture to understand what we need exactly from k8s glue.
I implemented the first version and it seems to work.
apiserver logs were clean, only 200s there
so far everything works, the only problem I can think of is a race condition, but I will probably ignore it, which happens in the following scenario:
a) a worker finishes its current run, turns into an idle state,
b) my script scrapes the status of the worker, which is idle,
c) a new task is enqueued and picked by the worker,
d) the worker is killed after it managed to pull a task from the queue, so the task will be cancelled as well.
AgitatedDove14 I need to finish working now, will be back in the evening and have a look.
AgitatedDove14 I looked at the K8s glue code, having something similar but for SLURM would be great!