Reputation
Badges 1
43 × Eureka!AgitatedDove14 thanks, that will be helpful!
I implemented the first version and it seems to work.
AgitatedDove14 I need to finish working now, will be back in the evening and have a look.
But I can't do this from the web ui, can I?
so far everything works, the only problem I can think of is a race condition, but I will probably ignore it, which happens in the following scenario:
a) a worker finishes its current run, turns into an idle state,
b) my script scrapes the status of the worker, which is idle,
c) a new task is enqueued and picked by the worker,
d) the worker is killed after it managed to pull a task from the queue, so the task will be cancelled as well.
that's ok, I think that the race condition will be a non-issue. Thanks for checking!
AgitatedDove14 I do not want to push you in any way, but if you could give me an estimate of the slurm glue code, that would be helpful. I should have a local installation of the trains server to experiment with next week.
AgitatedDove14 I looked at the K8s glue code, having something similar but for SLURM would be great!
thanks, next time I will provide you will all the logs
I will only cancel daemons which are idle.
No, they were not SuccessfulKoala55
yes, but the local output was completely empty
that was quick, thanks!
AgitatedDove14 if I use report_image can I get a URL to it somehow?
there was a problem with index order when converting from pytorch tensor to numpy array
yes, happy to help! In fact I am also interested in the k8s glue, since in one of our use cases we are using jobs and not pods (to allow for spot instances in the cloud), but I need to dig digger into the architecture to understand what we need exactly from k8s glue.
I hope you can do this without containers.
Unfortunately there is no docker, there is only singularity. This cluster is used by many users and docker is not secure enough.
AgitatedDove14 FYI: I am using pytorch
apiserver logs were clean, only 200s there
and the experiment did not produce any logs, shall I enable some debug flag?
AgitatedDove14 going back to the slurm subject, I have local trains installed on the cluster with slurm so I am ready to test. At the same time I was thinking whether a simple solution would do the job:
a) [scale up agents] monitor the trains queue, if there is something that was not started for some amount of time, and the number of agents is not yet at the maximum, then add an agent,
b) [scale down agents] if all the tasks are running and there are idle agents, kill an idle agent.
Or do yo...
SuccessfulKoala55 20 minutes at least
AgitatedDove14 thanks for the additional information:
yes, the report_image problem was resolved after I reordered dimensions in the tensor. is there an advantage in using tensorboard over your reporting? html reporting looks powerfull, can one inject some javascript inside?
sure, we can deal with the drivers