Hi Everyone, Looking For Ml Management Tools I Stumbled Upon Trains, I Must Say It Has Been Awesome So Far. I Just Have A (Probably Stupid) Question: I'M Trying To Setup A Multi-Node Training Environment And I Thought I Could Solve This With Agents, But A

Unanswered

SmilingFrog76

there is no internal scheduler in Trains

So obviously there is a scheduler built into Trains, this is the queues (order / priority)
What is missing from it is multi node connection, e.g. I need two agents running the exact same job working together.
(as opposed to, I have two jobs, execute them separately when a resource is available)

Actually my suggestion was to add a SLURM integration, like we did with k8s (I'm not suggesting Kubernetes as a solution for you, the opposite, k8s does not have the kind of scheduler you are looking for, only SLURM has it)

There is also a middle-ground between 2 & 3.
Let's assume we clone an experiment in the UI, and configured it, this Task has a unique ID (press on the ID button next to the name to see it).
We could schedule a slurm job that basically runs the following command on two nodes:

trains-agent execute --full-monitoring --id <my_task_id_here>This way we get the benefit of having the ability to change arguments, and have trains-agent setup the environement / code for us, and use SLURM to schedule the actual job.

What do you think?

BTW:(Option (3) is basically automating this exact procedure.

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

183 Views

0 Answers

4 years ago

2 years ago