Hi AgitatedDove14 , thanks for the quick reply! Well, not sure if this answers your question, but what I'd like to have is a scheduler that is aware of the resources that each agent has access to, in some way. E.g. I launch an agent in machine1, then an agent in machine2, but when I enqueue a task I'd like to say "this requires 2 GPUs" and some sort of scheduling machanism assigns it to any available agent with those resources available. Of course, once I turn off the machine1, agent 1 also dies, so the scheduler should handle new tasks accordingly. This is basically how I imagined it to work, I apologize if it's a weird mechanism and/or the Trains stack was not supposed to work this way.
Hi SmilingFrog76
Great question, sadly multi-node is never simple 🙂
Let's start with the basic, let's assume one worker is available and the other is not, what would you want to happen? (p.s. I'm not aware of flexible multi-node training frameworks, i.e. a framework that can detect another node is available and connect with it mid training, that said, it might exist 🙂 )
SmilingFrog76
there is no internal scheduler in Trains
So obviously there is a scheduler built into Trains, this is the queues (order / priority)
What is missing from it is multi node connection, e.g. I need two agents running the exact same job working together.
(as opposed to, I have two jobs, execute them separately when a resource is available)
Actually my suggestion was to add a SLURM integration, like we did with k8s (I'm not suggesting Kubernetes as a solution for you, the opposite, k8s does not have the kind of scheduler you are looking for, only SLURM has it)
There is also a middle-ground between 2 & 3.
Let's assume we clone an experiment in the UI, and configured it, this Task has a unique ID (press on the ID button next to the name to see it).
We could schedule a slurm
job that basically runs the following command on two nodes:
trains-agent execute --full-monitoring --id <my_task_id_here>
This way we get the benefit of having the ability to change arguments, and have trains-agent setup the environement / code for us, and use SLURM to schedule the actual job.
What do you think?
BTW:(Option (3) is basically automating this exact procedure.
I see, thanks! So, correct me if I'm wrong: there is no way for the agent to deal with resource management, it's simply not its job (seems logical). Also, there is no internal scheduler in Trains, I can just create more queues and spin up N agents, each with their own resource set, even in different machines. If that's it, I only see these alternatives:
forget about "abstracting resources" and simply use more queues and/or manually partition stuff among agents Use Trains as experiment tracker, while configuring the two machines with SLURM for instance Setup a k8s cluster and implement your proposed exampleThe last point is a bit fuzzy to me, mostly because I never tinkered with k8s before (and it seems quite overkill to me for a dual-node configuration), so I'm more prone to attempt #2 and fallback to the second one. Do you see other alternatives (i.e. maybe slurm can work with agents)? Thanks again and sorry to bother!
SmilingFrog76 this is not a weird mechanism at all , this is proper HPC scheduler 🙂trains-agent
is not actually aware of other nodes, it is responsible for launching a Task on its own hardware (with whatever configuration it was set). What can be done is to use the trains-agent
inside a 3rd party scheduler and have the scheduler allocate the node and trains-agent spin the experiment. There is a k8s example here: basically pulling jobs for the trains-server queue and pushing them into a k8s scheduler. The same can be done with slurm.
What do you think?
https://github.com/allegroai/trains-agent/blob/master/examples/k8s_glue_example.py