thanks for the link advice, will do
I'll let you know if I managed to achieve my goals with StorageManager
I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data
hmmm, somehow I have a bed feeling about it... Could you check the log, it should say something like "Collecting torch==1.6.0.dev20200421+cu101 from https://"
It should be right at the top of the installation. What do you have there?
I added the link just in case anyway π
also, is there any way to install a repo that we clone as a package. we often use absolute imports and do "pip install -e ." to utilize it
sorry there are so many questions, we just really want to migrate to trains-agent)
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
awesome news π
DilapidatedDucks58 trains-agent adds the artifactory URL as --extra-index-url , are you sure you are getting the correct torch version in the container? because the torch html is not an artifactory html, it is a list of links, I just want to make sure you are getting the correct version, because otherwise it can default to the CPU version, which we don't want π anyhow you can use the direct link in the "installed packages and just put there " https://download.pytorch.org/whl/nightly/cu101/torch-1.6.0.dev20200421%2Bcu101-cp36-cp36m-linux_x86_64.whl " instead of torch==
Regrading the precomputed files, yes, if they are already on S3 you can do:from trains.storage import StorageManager local_location_of_the_file = StorageManager.get_local_copy('
')
And since there is now caching, and it is persistent over runs (yes even in containers) you only download the file once π what do you think? (Also it give you the option of replacing these files, and now you will be able to have the link as one of the Task parameters)
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
Any chance they try to store the TensorBoard on this folder ? This could lead to "No such file or directory: 'runs'" if one is deleting it, and the other is trying to access, or similar scenarios
the code that is used for training the model is also inside the image
Hi DilapidatedDucks58 just making sure, the link is pyrorch nightly artifactory? Or is it a direct link to the package? Reason for asking, I was not aware they have proper artifactory... When the task runs the trains agent will update the installed packages with all the installed packages it used. Could you verify you have the correct version?
Regarding the extra files, you are correct, the docker container is reset every run, so they will get lost. What are those files for? Could you add them to the git repo? You could also pull them from a url with StorageManager and the cache is persistent, so it's very efficient
weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101
this is the artifactory, this is how I install these packages in the Docker image:
pip3 install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
the files are used for training and evaluation (e.g., precomputed pycocotools meta-info). I could theoretically include them in the repo, but some of them might be quite heavy. what do you mean when you say that they get lost? I copy them from the host machine when I build the custom image, so they are inside the contained when it runs
using storage manager is a decent idea, the files are on S3 anyway
no, I even added the argument to specify tensorboard log_dir to make sure this is not happening
Hi DilapidatedDucks58 ,
Are you running in docker or venv mode?
Do the works share a folder on the host machine?
It might be syncing issue (not directly related to the trains-agent but to the facts you have 4 processes trying to simultaneously access the same resource)
BTW: the next trains-agent RC will have a flag (default off) for torch-nightly repository support π
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers
Hi DilapidatedDucks58
trains-agent tries to resolvethe torch package based on the specific cuda version inside the docker (or on the host machine is if used in virtual-env mode). It seems to fail finding the specific version "torch==1.6.0.dev20200421+cu101"
I assume this version was automatically detected by trains when running manually. If this version came from a private artifactory you can add it to the trains.conf https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L54 You can also replace it in the "installed packages" with a direct http link (notice that once you clone the experiment, this section becomes editable)What do you think?
I added the link just in case anywayΒ
Smart move :)
DilapidatedDucks58 , Of course there is π actually with the latest pip 20.1 and the next RC it will be automatically detected and put into "installed package"
You can treat the "installed packages" just like you would any other "requirements.txt", just add:git+
https://github.com/ ...
and you are good to go
great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)
the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno 2] No such file or directory: 'runs'
standalone-mode gives me "Could not freeze installed packages"
A true mystery π
That said, I hardly think it is directly related to the trains-agent
...
Do you have any more insights on when / how it happens ?