I didn't try trains-agent yet, does it support using AWS batch?
For now we are using AWS batch for running those experiments.
because like this we don`t have to hold machines who waits for the jobs
OHH nice, I thought that it just some kind of job queue on up and running machines
It's much more than that, it's a way of life ๐
But seriously now, it allows you to use any machine as part of your cluster, and send jobs for execution from the web UI (any machine, even just a standalong GPU machine under your desk, or any cloud GPU instance any mixing the two together:)
Maybe I need to change something here:
apiserver.conf
Not sure, I'm still waiting on answer... It might not be exposed to the configuration file. Give me an hour or two
AgitatedDove14 Maybe I need to change something here: apiserver.conf
for increasing workers number?
It manages the scheduling process, so no need to package your code, or worry about building dockers etc. It also has an AWS autoscaler, that spins ec2 instances based on the amount of jobs you have in the execution queue, and the limit of your budget (obviously spinning down machines that are idle)
Let me check... I think you might need to docker exec
Anyhow, I would start by upgrading the server itself.
Sounds good?
Thanks I will upgrade the server for now and will let you know
CooperativeFox72 btw, are you guys running those 20 experiments manually or through trains-agent ?
Hi CooperativeFox72 ,
From the backend guys, long story short, upgrade your machine => more cpu cores , more processes , it is that easy ๐
OHH nice, I thought that it just some kind of job queue on up and running machines
Thanks I will upgrade my instance type and the add more workers. where I need to configure it?
CooperativeFox72 yes 20 experiments in parallel means that you always have at least 20 connection coming from different machines, and then you have the UI adding on top of it. I'm assuming the sluggishness you feel are the requests being delayed.
You can configure the API server to have more process workers, you just need to make sure the machine has enough memory to support it.
I an running trains-server on AWS with your AMI (instance type t3.large)
The server runs very good, and works amazing!
Until we start to run more training in parallel (around 20).
Then, the UI start to be very slow and getting timeouts often.
Does upgrading the instance type can help here? or there is some limit to parallel running?
The cool thing of using the trains-agent, you can change any experiment parameters and automate the process, so you get hyper-parameter optimization out of the box, and you can build complicated pipelines
https://github.com/allegroai/trains/tree/master/examples/optimization/hyper-parameter-optimization
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
CooperativeFox72 of course, anything trains related, this is the place ๐
Fire away
Thanks!! you are the best..
I will give it a try when the runs will finish