Hi SmilingFrog76
Great question, sadly multi-node is never simple 🙂
Let's start with the basic, let's assume one worker is available and the other is not, what would you want to happen? (p.s. I'm not aware of flexible multi-node training frameworks, i.e. a framework that can detect another node is available and connect with it mid training, that said, it might exist 🙂 )
SmilingFrog76 this is not a weird mechanism at all , this is proper HPC scheduler 🙂trains-agent
is not actually aware of other nodes, it is responsible for launching a Task on its own hardware (with whatever configuration it was set). What can be done is to use the trains-agent
inside a 3rd party scheduler and have the scheduler allocate the node and trains-agent spin the experiment. There is a k8s example here: basically pulling jobs for the trains-server queue and pushing them into a k8s scheduler. The same can be done with slurm.
What do you think?
https://github.com/allegroai/trains-agent/blob/master/examples/k8s_glue_example.py
I see, thanks! So, correct me if I'm wrong: there is no way for the agent to deal with resource management, it's simply not its job (seems logical). Also, there is no internal scheduler in Trains, I can just create more queues and spin up N agents, each with their own resource set, even in different machines. If that's it, I only see these alternatives:
forget about "abstracting resources" and simply use more queues and/or manually partition stuff among agents Use Trains as experiment tracker, while configuring the two machines with SLURM for instance Setup a k8s cluster and implement your proposed exampleThe last point is a bit fuzzy to me, mostly because I never tinkered with k8s before (and it seems quite overkill to me for a dual-node configuration), so I'm more prone to attempt #2 and fallback to the second one. Do you see other alternatives (i.e. maybe slurm can work with agents)? Thanks again and sorry to bother!
SmilingFrog76
there is no internal scheduler in Trains
So obviously there is a scheduler built into Trains, this is the queues (order / priority)
What is missing from it is multi node connection, e.g. I need two agents running the exact same job working together.
(as opposed to, I have two jobs, execute them separately when a resource is available)
Actually my suggestion was to add a SLURM integration, like we did with k8s (I'm not suggesting Kubernetes as a solution for you, the opposite, k8s does not have the kind of scheduler you are looking for, only SLURM has it)
There is also a middle-ground between 2 & 3.
Let's assume we clone an experiment in the UI, and configured it, this Task has a unique ID (press on the ID button next to the name to see it).
We could schedule a slurm
job that basically runs the following command on two nodes:
trains-agent execute --full-monitoring --id <my_task_id_here>
This way we get the benefit of having the ability to change arguments, and have trains-agent setup the environement / code for us, and use SLURM to schedule the actual job.
What do you think?
BTW:(Option (3) is basically automating this exact procedure.
Hi AgitatedDove14 , thanks for the quick reply! Well, not sure if this answers your question, but what I'd like to have is a scheduler that is aware of the resources that each agent has access to, in some way. E.g. I launch an agent in machine1, then an agent in machine2, but when I enqueue a task I'd like to say "this requires 2 GPUs" and some sort of scheduling machanism assigns it to any available agent with those resources available. Of course, once I turn off the machine1, agent 1 also dies, so the scheduler should handle new tasks accordingly. This is basically how I imagined it to work, I apologize if it's a weird mechanism and/or the Trains stack was not supposed to work this way.