Hi, I Am New Here, Can I Ask Question On Trains-Server Also?

Answered

Hi, I am new here, can I ask question on trains-server also?

  				
Posted 
	4 years ago

					More  		
  Report
		
					CooperativeFox72
				
					0
					 × 1

Votes Newest

Answers 17

Thanks!! you are the best..
I will give it a try when the runs will finish

  				
Posted 
	4 years ago

					More  		
  Report
		
					CooperativeFox72
				
					0
					 × 1

Hi CooperativeFox72 ,
From the backend guys, long story short, upgrade your machine => more cpu cores , more processes , it is that easy 🙂

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

OHH nice, I thought that it just some kind of job queue on up and running machines

It's much more than that, it's a way of life 🙂
But seriously now, it allows you to use any machine as part of your cluster, and send jobs for execution from the web UI (any machine, even just a standalong GPU machine under your desk, or any cloud GPU instance any mixing the two together:)

Maybe I need to change something here:

apiserver.conf

Not sure, I'm still waiting on answer... It might not be exposed to the configuration file. Give me an hour or two

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Maybe I need to change something here: apiserver.conf
for increasing workers number?

  				
Posted 
	4 years ago

					More  		
  Report
		
					CooperativeFox72
				
					0
					 × 1

I will go over the examples

  				
Posted 
	4 years ago

					More  		
  Report
		
					CooperativeFox72
				
					0
					 × 1

OHH nice, I thought that it just some kind of job queue on up and running machines

  				
Posted 
	4 years ago

					More  		
  Report
		
					CooperativeFox72
				
					0
					 × 1

The cool thing of using the trains-agent, you can change any experiment parameters and automate the process, so you get hyper-parameter optimization out of the box, and you can build complicated pipelines
https://github.com/allegroai/trains/tree/master/examples/optimization/hyper-parameter-optimization
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It manages the scheduling process, so no need to package your code, or worry about building dockers etc. It also has an AWS autoscaler, that spins ec2 instances based on the amount of jobs you have in the execution queue, and the limit of your budget (obviously spinning down machines that are idle)

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I didn't try trains-agent yet, does it support using AWS batch?

  				
Posted 
	4 years ago

					More  		
  Report
		
					CooperativeFox72
				
					0
					 × 1

For now we are using AWS batch for running those experiments.
because like this we don`t have to hold machines who waits for the jobs

  				
Posted 
	4 years ago

					More  		
  Report
		
					CooperativeFox72
				
					0
					 × 1

Thanks I will upgrade the server for now and will let you know

  				
Posted 
	4 years ago

					More  		
  Report
		
					CooperativeFox72
				
					0
					 × 1

CooperativeFox72 btw, are you guys running those 20 experiments manually or through trains-agent ?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Let me check... I think you might need to docker exec
Anyhow, I would start by upgrading the server itself.
Sounds good?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks I will upgrade my instance type and the add more workers. where I need to configure it?

  				
Posted 
	4 years ago

					More  		
  Report
		
					CooperativeFox72
				
					0
					 × 1

CooperativeFox72 yes 20 experiments in parallel means that you always have at least 20 connection coming from different machines, and then you have the UI adding on top of it. I'm assuming the sluggishness you feel are the requests being delayed.
You can configure the API server to have more process workers, you just need to make sure the machine has enough memory to support it.

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I an running trains-server on AWS with your AMI (instance type t3.large)

The server runs very good, and works amazing!
Until we start to run more training in parallel (around 20).
Then, the UI start to be very slow and getting timeouts often.
Does upgrading the instance type can help here? or there is some limit to parallel running?

  				
Posted 
	4 years ago

					More  		
  Report
		
					CooperativeFox72
				
					0
					 × 1

CooperativeFox72 of course, anything trains related, this is the place 🙂
Fire away

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

17 Answers

4 years ago

2 years ago