Hi Everyone, Looking For Ml Management Tools I Stumbled Upon Trains, I Must Say It Has Been Awesome So Far. I Just Have A (Probably Stupid) Question: I'M Trying To Setup A Multi-Node Training Environment And I Thought I Could Solve This With Agents, But A

Answered

Hi everyone,
Looking for ML management tools I stumbled upon Trains, I must say it has been awesome so far. I just have a (probably stupid) question: I'm trying to setup a multi-node training environment and I thought I could solve this with agents, but after a few trials I'm not sure about it.
I've got two "workstations", with 4 GPUs each. Is there a simple way to "abstract" from the machine layer and make sure that the Trains agents allocate the work where it is possible, or am I forced to start two individual agents, one per machine, and manually allocate jobs to one of the two? Basically, is there any "automagical" scheduling tool based on available resources that I can use, or is it completely out of scope?
Again, sorry for the borderline-dumb question, I believe this could be solved with a HPC solution by first merging the machines into a single cluster and then applying the agent or something similar, but my knowledge stops way before that.

TLDR; In essence, I'm looking for a solution to automatically schedule tasks between two worker machines, if possible. If it's feasible to achieve this within Trains, amazing, but I'm not excluding other alternatives.

Thanks in advance!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmilingFrog76
				
					0
					 × 1

Votes Newest

Answers 5

Hi AgitatedDove14 , thanks for the quick reply! Well, not sure if this answers your question, but what I'd like to have is a scheduler that is aware of the resources that each agent has access to, in some way. E.g. I launch an agent in machine1, then an agent in machine2, but when I enqueue a task I'd like to say "this requires 2 GPUs" and some sort of scheduling machanism assigns it to any available agent with those resources available. Of course, once I turn off the machine1, agent 1 also dies, so the scheduler should handle new tasks accordingly. This is basically how I imagined it to work, I apologize if it's a weird mechanism and/or the Trains stack was not supposed to work this way.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmilingFrog76
				
					0
					 × 1

I see, thanks! So, correct me if I'm wrong: there is no way for the agent to deal with resource management, it's simply not its job (seems logical). Also, there is no internal scheduler in Trains, I can just create more queues and spin up N agents, each with their own resource set, even in different machines. If that's it, I only see these alternatives:
forget about "abstracting resources" and simply use more queues and/or manually partition stuff among agents Use Trains as experiment tracker, while configuring the two machines with SLURM for instance Setup a k8s cluster and implement your proposed exampleThe last point is a bit fuzzy to me, mostly because I never tinkered with k8s before (and it seems quite overkill to me for a dual-node configuration), so I'm more prone to attempt #2 and fallback to the second one. Do you see other alternatives (i.e. maybe slurm can work with agents)? Thanks again and sorry to bother!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SmilingFrog76
				
					0
					 × 1

Hi SmilingFrog76
Great question, sadly multi-node is never simple 🙂
Let's start with the basic, let's assume one worker is available and the other is not, what would you want to happen? (p.s. I'm not aware of flexible multi-node training frameworks, i.e. a framework that can detect another node is available and connect with it mid training, that said, it might exist 🙂 )

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

SmilingFrog76 this is not a weird mechanism at all , this is proper HPC scheduler 🙂
trains-agent is not actually aware of other nodes, it is responsible for launching a Task on its own hardware (with whatever configuration it was set). What can be done is to use the trains-agent inside a 3rd party scheduler and have the scheduler allocate the node and trains-agent spin the experiment. There is a k8s example here: basically pulling jobs for the trains-server queue and pushing them into a k8s scheduler. The same can be done with slurm.

What do you think?
https://github.com/allegroai/trains-agent/blob/master/examples/k8s_glue_example.py

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

SmilingFrog76

there is no internal scheduler in Trains

So obviously there is a scheduler built into Trains, this is the queues (order / priority)
What is missing from it is multi node connection, e.g. I need two agents running the exact same job working together.
(as opposed to, I have two jobs, execute them separately when a resource is available)

Actually my suggestion was to add a SLURM integration, like we did with k8s (I'm not suggesting Kubernetes as a solution for you, the opposite, k8s does not have the kind of scheduler you are looking for, only SLURM has it)

There is also a middle-ground between 2 & 3.
Let's assume we clone an experiment in the UI, and configured it, this Task has a unique ID (press on the ID button next to the name to see it).
We could schedule a slurm job that basically runs the following command on two nodes:

trains-agent execute --full-monitoring --id <my_task_id_here>This way we get the benefit of having the ability to change arguments, and have trains-agent setup the environement / code for us, and use SLURM to schedule the actual job.

What do you think?

BTW:(Option (3) is basically automating this exact procedure.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

465 Views

5 Answers

3 years ago

one year ago