I Keep Getting Errors When Trying To Compare A Lot Of Experiments At The Same Time (>10). What'S Evern Worse Is That Trains Start Working Much Slower In General After These Attempts, The Only Way To Fix It Is To Restart The Whole Thing. Would Getting Bett

Answered

I keep getting errors when trying to compare a lot of experiments at the same time (>10). what's evern worse is that trains start working much slower in general after these attempts, the only way to fix it is to restart the whole thing. would getting better EC2 instance help? any other ideas?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Votes Newest

Answers 30

btw, there are "[2020-09-02 15:15:40,331] [9] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch" in the apiserver logs again

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Can you share all the error info that you get in the network tab?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

some of the POST requests "tasks.get_all_ex" fail as far as I can see

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Do you see any error in the browser network tab?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

Error
Failed to get Scalar Charts

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Great! What error do you still see in UI when comparing more than 20 experiments? At the time of error do you see any error response from the apiserver (in the browser network tab)? When the call to compare of 20+ task metrics succeed how much time does it usually takes in your environment?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

nope, the only changes to config that we made are adding web-auth and non-responsive tasks watchdog

just in case, this warning disappeared after I https://stackoverflow.com/questions/49638699/docker-compose-restart-connection-pool-full

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Hi DilapidatedDucks58 , I am trying to reproduce the "Connection is full warning". Do you override any apiserver environment variables is docker compose? If yes then can you share your version of docker-compose? Do you provide a configuration file for gunicorn? Can you please share it?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

thanks!

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

I'll think about it and get back to you - it might be interesting to understand what causes this when comparing >20 experiments...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

running
docker network prunebefore starting the containers kind of helped. I still see an error when I'm comparing > 20 experiments, but at least trains works okay after that, and there are no connection pool limit errors in the logs

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

I'll try to check it out and get back to you, seems very strange

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Can you please check the max connections setting in your OS?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

any suggestions on how to fix it?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Yeah, connections keep getting dropped since the pool is full

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

This shouldn't be an issue, since the server should reuse connections, but perhaps the max connections limit on your server os relatively low?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

we do log a lot of the different metrics, maybe this can be part of the problem

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Is it possible to get a longer log file for the apiserver? From what I see, there's some kind of a connection pool issue

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

OK, on first glance ES doesn't seem to have any issue

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

And the trains-elasticsearch log?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

network logs

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

tail of the api server log

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

0.16.1

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

btw, are there any examples of exporting metrics using Python client? I could only find last_metrics attribute of the task

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Also, what's the Trains Server version?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

okay, give me a sec

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Well, server seems OK, disk size might be a little on the low-end (just to be safe)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

m5.xlarge EC2 instance (4 vCPUs, 16 GB RAM), 100GB disk

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

As always, the server log ( trains-apiserver for start) and more details from the browser's developer tools Network section would be appreciated 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

What is your server machine profile/spec?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

30 Answers

5 years ago

2 years ago