Reputation
Badges 1
186 × Eureka!as a sidenote, I am not able to pull the newest release, looks like it's not pushed?
"Error response from daemon: manifest for allegroai/trains:0.14.2 not found"
I've done it many times, using different devices. sometimes it works, sometimes it doesn't
I assume, temporary fix is to switch to trains-server?
thanks! this bug and cloning problem seem to be fixed
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration
but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?
okay, so if thereโs no workaround atm, should I create a Github issue?
I use Docker for training, which means that log_dir contents are removed for the continued experiment btw
maybe I should use explicit reporting instead of Tensorboard
still no luck, I tried everything =( any updates?
dnk if it's relevant, but I also added a new user to apiserver.conf today
I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!
I don't think so because max value of each metric is calculated independently of other metrics
so max values that I get can be reached at the different epochs
just DMed you a screenshot where you can see a part of the token
in order to use private repositories for our experiments I add agent.git_user and agent.git_pass options to clearml.conf when launching agents
if someone accidentally tries to launch an experiment from non-existing repo, ClearML will print
fatal: repository ' https://username:token@github.com/our_organization/non_existing_repo.git/ ' not found
exposing the real token
does this mean that setting initial iteration to 0 should help?
yes, this is the use case, I think we can use smth like Redis for this communication
yeah, we've used pipelines in other scenarios. might be a good fit here. thanks!
we have a baremetal server with ClearML agents, and sometimes there are hanging containers or containers that consume too much RAM. unless I explicitly add container name in container arguments, it will have a random name, which is not very convenient. it would be great if we could set default container name for each experiment (e.g., experiment id)
yeah, that sounds right! thanks, will try
for me, increasing shm-size usually helps. what does this RC fix?
I updated the version in the Installed packages section before starting the experiment
LOL
wow ๐
I was trying to find how to create a queue using CLI ๐
great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers