
Reputation
Badges 1
43 × Eureka!ok, there is probably a problem on my side, because when I ran the sample code from the repo it works, sorry to bother you
AgitatedDove14 FYI: I am using pytorch
there was a problem with index order when converting from pytorch tensor to numpy array
apiserver logs were clean, only 200s there
the images do not show up in debug_samples. How can I check what is wrong?
and the experiment did not produce any logs, shall I enable some debug flag?
AgitatedDove14 if I use report_image
can I get a URL to it somehow?
yes, this is what I found as well
some piece of html+js code that you can add that governs how to visualize debug_samples from experiments that are already finished, think of adding an overlay of two types of images post factum
Not sure yet, I will get back to you on this later, in 1-2 weeks, thanks.
yes, but the local output was completely empty
that's ok, I think that the race condition will be a non-issue. Thanks for checking!
AgitatedDove14 thanks, that will be helpful!
AgitatedDove14 I do not want to push you in any way, but if you could give me an estimate of the slurm glue code, that would be helpful. I should have a local installation of the trains server to experiment with next week.
so far everything works, the only problem I can think of is a race condition, but I will probably ignore it, which happens in the following scenario:
a) a worker finishes its current run, turns into an idle state,
b) my script scrapes the status of the worker, which is idle,
c) a new task is enqueued and picked by the worker,
d) the worker is killed after it managed to pull a task from the queue, so the task will be cancelled as well.
thanks, next time I will provide you will all the logs
No, they were not SuccessfulKoala55
AgitatedDove14 I looked at the K8s glue code, having something similar but for SLURM would be great!
AgitatedDove14 I meant the following scenario:
trains-agents will be running as slurm jobs (possibly for a very long time), there is a program running on an access-node of the cluster (where no computation happens, but from where one can submit jobs to slurm), this program check is there are not enough or too many agents running and adjusts them by cancelling them or spinning new ones.
I hope you can do this without containers.
Unfortunately there is no docker, there is only singularity. This cluster is used by many users and docker is not secure enough.
Yes, replace image viewer with a custom widget, but perhaps we can implement this externally (I am not an expert on UI to be honest)
AgitatedDove14 thanks for the additional information:
yes, the report_image problem was resolved after I reordered dimensions in the tensor. is there an advantage in using tensorboard over your reporting? html reporting looks powerfull, can one inject some javascript inside?
For the images themselves, you can get heir urls
how can I do it?