WackyRabbit7 my apologies for the lack of background in my answer 🙂
Let me start from the top, one of the goal of the trains-agent is to reproduce the "original" execution environment. Once that is done, it will launch the code and monitor it. In order to reproduce the original execution environment, trains-agent will install all the needed python packages, pull the code, and apply the uncommitted changes.
If your entire environment is python based, then virtual-environment mode is probably the easiest starting point.
But sometimes python environment only is not enough, for example switching between CUDA / Nvidia drivers, or nvidia-apex package which cannot be easily downloaded, or any other software you need to install that is outside the pythonic realm.
In order to achieve this, you can specific a base docker (one that you already premade, with the non-pythonic stuuff already preinstalled in the docker), and then trains-agent will do the entire thing (code/python paclages/monitoring) inside the docker. This means this docker is like a base machine level setup for the experiment.
When running in docker mode, the default docker used is nvidia/cuda
, and if no docker image is specified on the experiment, the nvidia/cuda docker will be used. A user can also specify any other base-docker, for example the horovod
docker or a docker you have built internally with all your non-python code already compiled and installed.
This all means that selecting a base docker for an experiment execution is just another parameter on the Task, that can be edited from the UI, like python packages. This also means that you don't need to build many docker images, and that you can use any docker from docker-hub, as base setup .
Make sense ?