Reputation
Badges 1
186 × Eureka!that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)
the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...
1 - yes, of course =) but it would be awesome if you could customize the content - to include key metrics and hyperparameters, for example
3 - hooooooraaaay
thanks for the link advice, will do
I'll let you know if I managed to achieve my goals with StorageManager
wow, thanks, just updated our server!
can't seem to find these metrics snapshot plots =) how do I plot one?
not necessarily, command usually stays the same irrespective of the machine
it will probably screw up my resource monitoring plots, but well, who cares 😃
perhaps it’s happening because it’s an old project that was moved to the new root project?
I'll get back to you with the logs when the problem occurs again
I change the arguments in Web UI, but it looks like they are not parsed by trains
same here, changing arguments in the Args section of Hyperparameters doesn’t work, training script starts with the default values.
trains 0.16.0
trains-agent 0.16.0
trains-server 0.16.0
this definitely would be a nice addition. number of hyperparameters in our models often goes up to 100
copy-pasting entire training command into command line 😃
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
awesome news 👍
standalone-mode gives me "Could not freeze installed packages"
nice! exactly what I need, thank you!
it prints an empty dict
I’m doing Task.init() in the script, maybe it somehow resets connected parameters… but it used to work before, weird
ValueError: Task has no hyperparams section defined
nope, same problem even after creating a new experiment from scratch
I added the link just in case anyway 😃
also, is there any way to install a repo that we clone as a package. we often use absolute imports and do "pip install -e ." to utilize it
sorry there are so many questions, we just really want to migrate to trains-agent)
weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101
I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers
not necessarily, there are rare cases when container keeps running after experiment is stopped or aborted
will do!
yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones
hmmm allegroai/trains:latest whatever it is
on the side note, is there any way to automatically give more meaningful names to the running docker containers?