Question About The Usage Of Trains Agents. In Our Company We Have 3 Hpc Servers, Two Of Them Have Multiple Gpus, One Is Cpu Only. I Saw In The Docs The Multiple Agents Can Be Run Separately Assigning Gpus In Whatever Manner You Want. My Questions Are 1

Unanswered

Hi WackyRabbit7 ,
Running in Docker mode provides you greater flexibility in terms of environment control, from switching cuda versions, to pre-compiled packages that are needed (think apt-get) etc. Specifically for DL if you are using multiple tensorflow versions, they are notorious for compiling against a specific CUDA version, and the only easy way to be able to switch between them would be different dockers. If your are a PyTorch user, then you are in luck, they have all the pytorch versions compiled with different cuda versions, and trains-agent will pick the correct one based on the installed cuda on the machine (which means you can safely use virtual environment mode). Lastly, switching from docker mode to virtual-environment mode is quite easy, basically rerun the agent with a different parameter, so you can always decide start with what is easier for you to setup and only later switch 🙂 So in theory, no problem, but how would you make sure the third agent will not pull jobs while the first two are running? Even though in theory you can have multiple processes sharing GPU resources it usually fails on memory allocation (the sum of all the allocated memory across all processes cannot exceed the hardware RAM limitation)... I mean you can just check before enqueue a job into the second queue if the machine is already doing something... but this seems quite fragile to maintain. If the machine has no GPU it will automatically switch to cpu-only. You can verify it by checking the runtime trains-agent configuration (printed to console when it starts), look for :agent.cuda_version = 0

I must say this whole framework is really DX aware (developer experience)...

Thank you! This is exactly what we are aiming for, and hearing from our community that we managed to convey this approach is truly important for us!
❤

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

136 Views

0 Answers

4 years ago

one year ago