Question About The Usage Of Trains Agents. In Our Company We Have 3 Hpc Servers, Two Of Them Have Multiple Gpus, One Is Cpu Only. I Saw In The Docs The Multiple Agents Can Be Run Separately Assigning Gpus In Whatever Manner You Want. My Questions Are 1

Answered

Question about the usage of trains agents.
In our company we have 3 HPC servers, two of them have multiple GPUs, one is CPU only.

I saw in the docs the multiple agents can be run separately assigning GPUs in whatever manner you want.

My questions are

What are the differences for running the agent daemons in docker mode vs not? What should I consider when choosing?
Assuming a machine has 2 gpus, will it cause any problems if we have 3 running agents on it, one using GPU1, one using GPU2 and a third one assigned to be able to use both? (will run a separate queue for the dual GPU agent)
In the CPU only agent - should I specify --only-cpu or it will figure it out on its own that there are no GPUs available?

Lastly, I must say this whole framework is really DX aware (developer experience, similar to UX but for developers) and I truly appreciate this kind of engineering. The only other tool I came across having a data scientist focused DX is Netflix's Metaflow. Great job!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Votes Newest

Answers 6

So I assume, trains assumes I have nvidia-docker installed on the agent machine?

docker + nvidia-docker-runtime are assumed to be installed
nvidia/cuda docaker image is pulled when requested (like any other container image)

Moreover, since I'm going to use Task.execute_remotely (and not through the UI) is there any code way to specify the docker image to be used?

Sure, task.set_base_docker(docker_cmd='nvidia/cuda -v /mnt:/tmp')
Notice that you can not only pass the docker image but also provide the docker with execution parameters like volume mounts or environment variables, etc.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

WackyRabbit7 my apologies for the lack of background in my answer 🙂
Let me start from the top, one of the goal of the trains-agent is to reproduce the "original" execution environment. Once that is done, it will launch the code and monitor it. In order to reproduce the original execution environment, trains-agent will install all the needed python packages, pull the code, and apply the uncommitted changes.
If your entire environment is python based, then virtual-environment mode is probably the easiest starting point.

But sometimes python environment only is not enough, for example switching between CUDA / Nvidia drivers, or nvidia-apex package which cannot be easily downloaded, or any other software you need to install that is outside the pythonic realm.

In order to achieve this, you can specific a base docker (one that you already premade, with the non-pythonic stuuff already preinstalled in the docker), and then trains-agent will do the entire thing (code/python paclages/monitoring) inside the docker. This means this docker is like a base machine level setup for the experiment.

When running in docker mode, the default docker used is nvidia/cuda , and if no docker image is specified on the experiment, the nvidia/cuda docker will be used. A user can also specify any other base-docker, for example the horovod docker or a docker you have built internally with all your non-python code already compiled and installed.
This all means that selecting a base docker for an experiment execution is just another parameter on the Task, that can be edited from the UI, like python packages. This also means that you don't need to build many docker images, and that you can use any docker from docker-hub, as base setup .

Make sense ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Makes sense

So I assume, trains assumes I have nvidia-docker installed on the agent machine?

Moreover, since I'm going to use Task.execute_remotely (and not through the UI) is there any code way to specify the docker image to be used?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Very nice thanks, I'm going to try the SA server + agents setup this week, let's see how it goes ✌

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Hi WackyRabbit7 ,
Running in Docker mode provides you greater flexibility in terms of environment control, from switching cuda versions, to pre-compiled packages that are needed (think apt-get) etc. Specifically for DL if you are using multiple tensorflow versions, they are notorious for compiling against a specific CUDA version, and the only easy way to be able to switch between them would be different dockers. If your are a PyTorch user, then you are in luck, they have all the pytorch versions compiled with different cuda versions, and trains-agent will pick the correct one based on the installed cuda on the machine (which means you can safely use virtual environment mode). Lastly, switching from docker mode to virtual-environment mode is quite easy, basically rerun the agent with a different parameter, so you can always decide start with what is easier for you to setup and only later switch 🙂 So in theory, no problem, but how would you make sure the third agent will not pull jobs while the first two are running? Even though in theory you can have multiple processes sharing GPU resources it usually fails on memory allocation (the sum of all the allocated memory across all processes cannot exceed the hardware RAM limitation)... I mean you can just check before enqueue a job into the second queue if the machine is already doing something... but this seems quite fragile to maintain. If the machine has no GPU it will automatically switch to cpu-only. You can verify it by checking the runtime trains-agent configuration (printed to console when it starts), look for :agent.cuda_version = 0

I must say this whole framework is really DX aware (developer experience)...

Thank you! This is exactly what we are aiming for, and hearing from our community that we managed to convey this approach is truly important for us!
❤

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So regarding 1, I'm not really sure what is the difference

When running in docker mode what is different the the regular mode? No where in the instructions is nvidia docker a prerequisite, so how exacly will tasks on GPU get executed?

I feel I don't underatand enough of the mechanism to (1) understand the difference between docker mode and not and (2) what is the use casr for each

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					WackyRabbit7
				
					0
					 × 1

Write your answer

890 Views

6 Answers

4 years ago

one year ago