WackyRabbit7 I guess we are discussing this one on a diff thread π but yes, should totally work, that's the idea
The scenario I'm going for is never to run on the dev machine, so all I'll need to do once the server + agents are up is to add task.execute_remotely...
after the Task.init
line and after the execution of the script is called on the dev machine, it won't actually run but rather enqueue itself for the agent to run it?
Continuing on this line of thought... Is it possible to call task.execute_remotely
on a CPU only machine (data scientists' laptop for example) and make the agent that fetches this task to run it using GPU? I'm asking that because it is mentioned that it replicates the running environment on the task creator... which is exactly what I'm not trying to do π
That is correct.
Obviously once it is in the system, you can just clone/edit/enqueue it.
Running it once is a mean to populate the trains-server.
Make sense ?
Is it possibe to launch a task from Machine C to the queue that Machine B's agent is listening to?
Yes, that's the idea
Do I have to have anything installed (aside from theΒ
trains
Β PIP package) on Machine C to do so?
Nothing, pure magic π
(without having to execute it first on Machine C)
Someone some where has to create the definition of the environment...
The easiest to go about it is to execute it one.
You can add to your code the following linetask.execute_remotely(queue_name='default')
This will cause you code to stop running and enqueue itself on a specific queue.
Quite useful if you want to make sure everything works, (like run a single step) then continue on another machine.
Notice that switching between cpu/gpu packages is taken care by the trains-agent, so you can run on cpu machine for testing, then enqueue the same code and the trains-agent on the GPU machine will install the correct version of the packages (TF or PyTroch)
A few implementation / design details:
When you run code with Trains (and call init) it will record your environment (python packages, git code, uncommitted changes etc) Everything is stored on the Task object in the trains-server, when you clone a task you literally create a copy of the Task object (i.e. a second experiment). on the cloned experiment, you can edit everything (parameters, git, base docker image etc) When you enqueue a Task you add its ID to the execution queue list a trains-agent listens on the execution pops the Task ID and sets up the environment as written on the task, then it will launch and monitor the code Multiple trains-agent workers can listen on the same queue, or on multiple queues in a priority fashion (i.e. first try to pop from the first, then the seconds etc.)