SmilingFrog76
there is no internal scheduler in Trains
So obviously there is a scheduler built into Trains, this is the queues (order / priority)
What is missing from it is multi node connection, e.g. I need two agents running the exact same job working together.
(as opposed to, I have two jobs, execute them separately when a resource is available)
Actually my suggestion was to add a SLURM integration, like we did with k8s (I'm not suggesting Kubernetes as a solution for you, the opposite, k8s does not have the kind of scheduler you are looking for, only SLURM has it)
There is also a middle-ground between 2 & 3.
Let's assume we clone an experiment in the UI, and configured it, this Task has a unique ID (press on the ID button next to the name to see it).
We could schedule a slurm
job that basically runs the following command on two nodes:
trains-agent execute --full-monitoring --id <my_task_id_here>
This way we get the benefit of having the ability to change arguments, and have trains-agent setup the environement / code for us, and use SLURM to schedule the actual job.
What do you think?
BTW:(Option (3) is basically automating this exact procedure.