Reputation
Badges 1
186 × Eureka!LOL
wow š
I was trying to find how to create a queue using CLI š
great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers
that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)
the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...
thanks for the link advice, will do
I'll let you know if I managed to achieve my goals with StorageManager
wow, thanks, just updated our server!
can't seem to find these metrics snapshot plots =) how do I plot one?
not necessarily, command usually stays the same irrespective of the machine
right now we can pass github secrets to the clearml agent training containers ( CLEARML_AGENT_GIT_PASS) to install private repos
we need a way to pass secrets to access our database with annotations
maybe I should use explicit reporting instead of Tensorboard
I guess I could manually explore different containers and their content š as far as I remember, I had to update Elastic records when we moved to the new cloud provider in order to update model URLs
sorry, my bad, after some manipulations I made it work. I have to manually change HTTP to HTTPS in config file for Web and Files (not API) server after initialization, but besides that it works
yeah, I am aware of trains-agent, we are planning to start using it soon, but still, copying original training command would be useful
new icons are slick, it would be even better if you could upload custom icons for the different projects
not quite. for example, Iām not sure which info is stored in Elastic and which is in MongoDB
1 - yes, of course =) but it would be awesome if you could customize the content - to include key metrics and hyperparameters, for example
3 - hooooooraaaay
I'm so happy to see that this problem has been finally solved!