Hey Guys, I'M Trying To Run An Experiment Using Trains-Agent. I Have A Custom Docker Image With Nightly Versions Of Pytorch And Our Own Library Installed From A Private Repo. I Was Assuming That These Packages Will Be Automatically Available To Trains Dur

Answered

hey guys, I'm trying to run an experiment using trains-agent. I have a custom Docker image with nightly versions of pytorch and our own library installed from a private repo. I was assuming that these packages will be automatically available to trains during the experiment run, but I get the following error when I'm trying to enqueue my experiment. is there any way to use libraries installed in the image instead of trying to reinstall them from scratch?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Votes Newest

Answers 20

Hi DilapidatedDucks58
trains-agent tries to resolvethe torch package based on the specific cuda version inside the docker (or on the host machine is if used in virtual-env mode). It seems to fail finding the specific version "torch==1.6.0.dev20200421+cu101"
I assume this version was automatically detected by trains when running manually. If this version came from a private artifactory you can add it to the trains.conf https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L54 You can also replace it in the "installed packages" with a direct http link (notice that once you clone the experiment, this section becomes editable)What do you think?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi DilapidatedDucks58 ,
Are you running in docker or venv mode?
Do the works share a folder on the host machine?
It might be syncing issue (not directly related to the trains-agent but to the facts you have 4 processes trying to simultaneously access the same resource)

BTW: the next trains-agent RC will have a flag (default off) for torch-nightly repository support 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

no, I even added the argument to specify tensorboard log_dir to make sure this is not happening

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

hmmm, somehow I have a bed feeling about it... Could you check the log, it should say something like "Collecting torch==1.6.0.dev20200421+cu101 from https://"
It should be right at the top of the installation. What do you have there?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

A true mystery 🙂
That said, I hardly think it is directly related to the trains-agent ...
Do you have any more insights on when / how it happens ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I added the link just in case anyway 😃

also, is there any way to install a repo that we clone as a package. we often use absolute imports and do "pip install -e ." to utilize it
sorry there are so many questions, we just really want to migrate to trains-agent)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

DilapidatedDucks58 trains-agent adds the artifactory URL as --extra-index-url , are you sure you are getting the correct torch version in the container? because the torch html is not an artifactory html, it is a list of links, I just want to make sure you are getting the correct version, because otherwise it can default to the CPU version, which we don't want 🙂 anyhow you can use the direct link in the "installed packages and just put there " https://download.pytorch.org/whl/nightly/cu101/torch-1.6.0.dev20200421%2Bcu101-cp36-cp36m-linux_x86_64.whl " instead of torch==

Regrading the precomputed files, yes, if they are already on S3 you can do:
from trains.storage import StorageManager local_location_of_the_file = StorageManager.get_local_copy(' ')And since there is now caching, and it is persistent over runs (yes even in containers) you only download the file once 🙂 what do you think? (Also it give you the option of replacing these files, and now you will be able to have the link as one of the Task parameters)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.

Any chance they try to store the TensorBoard on this folder ? This could lead to "No such file or directory: 'runs'" if one is deleting it, and the other is trying to access, or similar scenarios

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

thanks for the link advice, will do
I'll let you know if I managed to achieve my goals with StorageManager

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

standalone-mode gives me "Could not freeze installed packages"

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working

I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/

but trains probably rewrites the folder when cloning the repo. is there any workaround?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)

the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno 2] No such file or directory: 'runs'

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

I added the link just in case anyway

Smart move :)

DilapidatedDucks58 , Of course there is 🙂 actually with the latest pip 20.1 and the next RC it will be automatically detected and put into "installed package"

You can treat the "installed packages" just like you would any other "requirements.txt", just add:
git+ https://github.com/ ... and you are good to go

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

the code that is used for training the model is also inside the image

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.

awesome news 👍

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Hi DilapidatedDucks58 just making sure, the link is pyrorch nightly artifactory? Or is it a direct link to the package? Reason for asking, I was not aware they have proper artifactory... When the task runs the trains agent will update the installed packages with all the installed packages it used. Could you verify you have the correct version?

Regarding the extra files, you are correct, the docker container is reset every run, so they will get lost. What are those files for? Could you add them to the git repo? You could also pull them from a url with StorageManager and the cache is persistent, so it's very efficient

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

this is the artifactory, this is how I install these packages in the Docker image:
pip3 install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html

the files are used for training and evaluation (e.g., precomputed pycocotools meta-info). I could theoretically include them in the repo, but some of them might be quite heavy. what do you mean when you say that they get lost? I copy them from the host machine when I build the custom image, so they are inside the contained when it runs
using storage manager is a decent idea, the files are on S3 anyway

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Write your answer

2K Views

20 Answers

5 years ago

2 years ago