Hi Everyone, Looking For Ml Management Tools I Stumbled Upon Trains, I Must Say It Has Been Awesome So Far. I Just Have A (Probably Stupid) Question: I'M Trying To Setup A Multi-Node Training Environment And I Thought I Could Solve This With Agents, But A

Unanswered

I see, thanks! So, correct me if I'm wrong: there is no way for the agent to deal with resource management, it's simply not its job (seems logical). Also, there is no internal scheduler in Trains, I can just create more queues and spin up N agents, each with their own resource set, even in different machines. If that's it, I only see these alternatives:
forget about "abstracting resources" and simply use more queues and/or manually partition stuff among agents Use Trains as experiment tracker, while configuring the two machines with SLURM for instance Setup a k8s cluster and implement your proposed exampleThe last point is a bit fuzzy to me, mostly because I never tinkered with k8s before (and it seems quite overkill to me for a dual-node configuration), so I'm more prone to attempt #2 and fallback to the second one. Do you see other alternatives (i.e. maybe slurm can work with agents)? Thanks again and sorry to bother!

  				
Posted 
	4 years ago

					More  		
  Report
		
					SmilingFrog76
				
					0
					 × 1

184 Views

0 Answers

4 years ago

2 years ago