Reputation
Badges 1
186 × Eureka!the weird part is that the old job continues running when I recreate the worker and enqueue the new job
example of the failed experiment
well okay, it's probably not that weird considering that worker just runs the container
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
I've done it many times, using different devices. sometimes it works, sometimes it doesn't
thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section
LOL
wow 😃
I was trying to find how to create a queue using CLI 😃
perhaps it’s happening because it’s an old project that was moved to the new root project?
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration
but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?
okay, so if there’s no workaround atm, should I create a Github issue?
I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!
not necessarily, there are rare cases when container keeps running after experiment is stopped or aborted
will do!
great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers
hmmm allegroai/trains:latest whatever it is
standalone-mode gives me "Could not freeze installed packages"
I added the link just in case anyway 😃
also, is there any way to install a repo that we clone as a package. we often use absolute imports and do "pip install -e ." to utilize it
sorry there are so many questions, we just really want to migrate to trains-agent)
weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101
on the side note, is there any way to automatically give more meaningful names to the running docker containers?
yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones
it prints an empty dict
I’m doing Task.init() in the script, maybe it somehow resets connected parameters… but it used to work before, weird
ValueError: Task has no hyperparams section defined