Hi Everyone, Looking For Ml Management Tools I Stumbled Upon Trains, I Must Say It Has Been Awesome So Far. I Just Have A (Probably Stupid) Question: I'M Trying To Setup A Multi-Node Training Environment And I Thought I Could Solve This With Agents, But A

Unanswered

SmilingFrog76 this is not a weird mechanism at all , this is proper HPC scheduler 🙂
trains-agent is not actually aware of other nodes, it is responsible for launching a Task on its own hardware (with whatever configuration it was set). What can be done is to use the trains-agent inside a 3rd party scheduler and have the scheduler allocate the node and trains-agent spin the experiment. There is a k8s example here: basically pulling jobs for the trains-server queue and pushing them into a k8s scheduler. The same can be done with slurm.

What do you think?
https://github.com/allegroai/trains-agent/blob/master/examples/k8s_glue_example.py

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

183 Views

0 Answers

4 years ago

2 years ago