(Do notice that even though you can spin two agents on the same GPU, the nvidia drivers cannot share allocated GPU memory, so if one Task consumes too much memory the other will not have enough free GPU memory to run)
Basically the same restriction as manually launching two processes using the same GPU
Hi SarcasticSparrow10
You will need to habe multiple trains-agent
s but they will be sharing the same queue (i.e. pulling jobs from the same queue the HPO process is pushing to)
Make sense ?
You will need to habe multiple
trains-agent
s but they will be sharing the same queue (i.e. pulling jobs from the same queue the HPO process is pushing to)
Make sense ?
Hmm. So say I have a parameter NUM_PARALLEL_EXECUTIONS
, I can programmatically launch that many trains-agent
for every optimization run?!
Correct 🙂
You can spin it in two modes, either venv or docker (notice that even in docker mode, it will still clone the code into the docker and install the packages inside the docker, but it also inherits from the docker preinstalled system packages, so that the installation process is a lot faster, but you have the ability to change packages without having to build an entire new docker image)
(Do notice that even though you can spin two agents on the same GPU, the nvidia drivers cannot share allocated GPU memory, so if one Task consumes too much memory the other will not have enough free GPU memory to run)
Basically the same restriction as manually launching two processes using the same GPU
That makes sense. Currently, I use python multiprocessing to launch multiple experiments on the sam GPU device. I'm guessing using trains-agent
will be similar
Got it. I haven't tried setting up trains-agent
yet so I don't know much about the overhead of launching the agent. I'd imagine if it has to create the full environment (installing requirements, etc), the overhead might not be that low. But as I'm reading, it looks like I can use a docker image with the full env. Is my understanding correct?
Basically it solves the remote-execution problem, so you can scale to multiple machines relatively easy :)