It manages the scheduling process, so no need to package your code, or worry about building dockers etc. It also has an AWS autoscaler, that spins ec2 instances based on the amount of jobs you have in the execution queue, and the limit of your budget (obviously spinning down machines that are idle)
The cool thing of using the trains-agent, you can change any experiment parameters and automate the process, so you get hyper-parameter optimization out of the box, and you can build complicated pipelines
I an running trains-server on AWS with your AMI (instance type t3.large)
The server runs very good, and works amazing!
Until we start to run more training in parallel (around 20).
Then, the UI start to be very slow and getting timeouts often.
Does upgrading the instance type can help here? or there is some limit to parallel running?
OHH nice, I thought that it just some kind of job queue on up and running machines
It's much more than that, it's a way of life 🙂
But seriously now, it allows you to use any machine as part of your cluster, and send jobs for execution from the web UI (any machine, even just a standalong GPU machine under your desk, or any cloud GPU instance any mixing the two together:)
Maybe I need to change something here:
Not sure, I'm still waiting on answer... It might not be exposed to the configuration file. Give me an hour or two
CooperativeFox72 yes 20 experiments in parallel means that you always have at least 20 connection coming from different machines, and then you have the UI adding on top of it. I'm assuming the sluggishness you feel are the requests being delayed.
You can configure the API server to have more process workers, you just need to make sure the machine has enough memory to support it.