Hi WackyRabbit7 ,
Running in Docker mode provides you greater flexibility in terms of environment control, from switching cuda versions, to pre-compiled packages that are needed (think apt-get) etc. Specifically for DL if you are using multiple tensorflow versions, they are notorious for compiling against a specific CUDA version, and the only easy way to be able to switch between them would be different dockers. If your are a PyTorch user, then you are in luck, they have all the pytorch versions compiled with different cuda versions, and trains-agent will pick the correct one based on the installed cuda on the machine (which means you can safely use virtual environment mode). Lastly, switching from docker mode to virtual-environment mode is quite easy, basically rerun the agent with a different parameter, so you can always decide start with what is easier for you to setup and only later switch 🙂 So in theory, no problem, but how would you make sure the third agent will not pull jobs while the first two are running? Even though in theory you can have multiple processes sharing GPU resources it usually fails on memory allocation (the sum of all the allocated memory across all processes cannot exceed the hardware RAM limitation)... I mean you can just check before enqueue a job into the second queue if the machine is already doing something... but this seems quite fragile to maintain. If the machine has no GPU it will automatically switch to cpu-only. You can verify it by checking the runtime trains-agent configuration (printed to console when it starts), look for :
agent.cuda_version = 0
I must say this whole framework is really DX aware (developer experience)...
Thank you! This is exactly what we are aiming for, and hearing from our community that we managed to convey this approach is truly important for us!
So I assume, trains assumes I have nvidia-docker installed on the agent machine?
docker + nvidia-docker-runtime are assumed to be installed
nvidia/cuda docaker image is pulled when requested (like any other container image)
Moreover, since I'm going to use
Task.execute_remotely(and not through the UI) is there any code way to specify the docker image to be used?
task.set_base_docker(docker_cmd='nvidia/cuda -v /mnt:/tmp')
Notice that you can not only pass the docker image but also provide the docker with execution parameters like volume mounts or environment variables, etc.
So regarding 1, I'm not really sure what is the difference
When running in docker mode what is different the the regular mode? No where in the instructions is nvidia docker a prerequisite, so how exacly will tasks on GPU get executed?
I feel I don't underatand enough of the mechanism to (1) understand the difference between docker mode and not and (2) what is the use casr for each
WackyRabbit7 my apologies for the lack of background in my answer 🙂
Let me start from the top, one of the goal of the trains-agent is to reproduce the "original" execution environment. Once that is done, it will launch the code and monitor it. In order to reproduce the original execution environment, trains-agent will install all the needed python packages, pull the code, and apply the uncommitted changes.
If your entire environment is python based, then virtual-environment mode is probably the easiest starting point.
But sometimes python environment only is not enough, for example switching between CUDA / Nvidia drivers, or nvidia-apex package which cannot be easily downloaded, or any other software you need to install that is outside the pythonic realm.
In order to achieve this, you can specific a base docker (one that you already premade, with the non-pythonic stuuff already preinstalled in the docker), and then trains-agent will do the entire thing (code/python paclages/monitoring) inside the docker. This means this docker is like a base machine level setup for the experiment.
When running in docker mode, the default docker used is
nvidia/cuda , and if no docker image is specified on the experiment, the nvidia/cuda docker will be used. A user can also specify any other base-docker, for example the
horovod docker or a docker you have built internally with all your non-python code already compiled and installed.
This all means that selecting a base docker for an experiment execution is just another parameter on the Task, that can be edited from the UI, like python packages. This also means that you don't need to build many docker images, and that you can use any docker from docker-hub, as base setup .
Make sense ?