Reputation
Badges 1
43 × Eureka!No, they were not SuccessfulKoala55
that was quick, thanks!
SuccessfulKoala55 20 minutes at least
and the experiment did not produce any logs, shall I enable some debug flag?
yes, but the local output was completely empty
apiserver logs were clean, only 200s there
thanks, next time I will provide you will all the logs
some piece of html+js code that you can add that governs how to visualize debug_samples from experiments that are already finished, think of adding an overlay of two types of images post factum
yes, this is what I found as well
AgitatedDove14 thanks for the additional information:
yes, the report_image problem was resolved after I reordered dimensions in the tensor. is there an advantage in using tensorboard over your reporting? html reporting looks powerfull, can one inject some javascript inside?
AgitatedDove14 FYI: I am using pytorch
For the images themselves, you can get heir urls
how can I do it?
Yes, replace image viewer with a custom widget, but perhaps we can implement this externally (I am not an expert on UI to be honest)
have some kind of an add-on not as a widget but in an external system (this is not the preffered way of course)
Not sure yet, I will get back to you on this later, in 1-2 weeks, thanks.
AgitatedDove14 I meant the following scenario:
trains-agents will be running as slurm jobs (possibly for a very long time), there is a program running on an access-node of the cluster (where no computation happens, but from where one can submit jobs to slurm), this program check is there are not enough or too many agents running and adjusts them by cancelling them or spinning new ones.
AgitatedDove14 Is there a way to say to a worker that it should not take new tasks? If there is such a feature then one could avoid the race condition.
yes, happy to help! In fact I am also interested in the k8s glue, since in one of our use cases we are using jobs and not pods (to allow for spot instances in the cloud), but I need to dig digger into the architecture to understand what we need exactly from k8s glue.
I hope you can do this without containers.
there was a problem with index order when converting from pytorch tensor to numpy array
Unfortunately there is no docker, there is only singularity. This cluster is used by many users and docker is not secure enough.
that's ok, I think that the race condition will be a non-issue. Thanks for checking!
AgitatedDove14 going back to the slurm subject, I have local trains installed on the cluster with slurm so I am ready to test. At the same time I was thinking whether a simple solution would do the job:
a) [scale up agents] monitor the trains queue, if there is something that was not started for some amount of time, and the number of agents is not yet at the maximum, then add an agent,
b) [scale down agents] if all the tasks are running and there are idle agents, kill an idle agent.
Or do yo...