Unanswered
Hi Everyone,
Looking For Ml Management Tools I Stumbled Upon Trains, I Must Say It Has Been Awesome So Far. I Just Have A (Probably Stupid) Question: I'M Trying To Setup A Multi-Node Training Environment And I Thought I Could Solve This With Agents, But A
SmilingFrog76 this is not a weird mechanism at all , this is proper HPC scheduler 🙂trains-agent
is not actually aware of other nodes, it is responsible for launching a Task on its own hardware (with whatever configuration it was set). What can be done is to use the trains-agent
inside a 3rd party scheduler and have the scheduler allocate the node and trains-agent spin the experiment. There is a k8s example here: basically pulling jobs for the trains-server queue and pushing them into a k8s scheduler. The same can be done with slurm.
What do you think?
https://github.com/allegroai/trains-agent/blob/master/examples/k8s_glue_example.py
149 Views
0
Answers
4 years ago
2 years ago