some of the POST requests "tasks.get_all_ex" fail as far as I can see
Can you share all the error info that you get in the network tab?
Well, server seems OK, disk size might be a little on the low-end (just to be safe)
btw, there are "[2020-09-02 15:15:40,331] [9] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch" in the apiserver logs again
nope, the only changes to config that we made are adding web-auth and non-responsive tasks watchdog
just in case, this warning disappeared after I https://stackoverflow.com/questions/49638699/docker-compose-restart-connection-pool-full
runningdocker network prune
before starting the containers kind of helped. I still see an error when I'm comparing > 20 experiments, but at least trains works okay after that, and there are no connection pool limit errors in the logs
What is your server machine profile/spec?
This shouldn't be an issue, since the server should reuse connections, but perhaps the max connections limit on your server os relatively low?
Yeah, connections keep getting dropped since the pool is full
Do you see any error in the browser network tab?
btw, are there any examples of exporting metrics using Python client? I could only find last_metrics attribute of the task
I'll think about it and get back to you - it might be interesting to understand what causes this when comparing >20 experiments...
Also, what's the Trains Server version?
Hi DilapidatedDucks58 , I am trying to reproduce the "Connection is full warning". Do you override any apiserver environment variables is docker compose? If yes then can you share your version of docker-compose? Do you provide a configuration file for gunicorn? Can you please share it?
Can you please check the max connections setting in your OS?
we do log a lot of the different metrics, maybe this can be part of the problem
OK, on first glance ES doesn't seem to have any issue
m5.xlarge EC2 instance (4 vCPUs, 16 GB RAM), 100GB disk
I'll try to check it out and get back to you, seems very strange
Great! What error do you still see in UI when comparing more than 20 experiments? At the time of error do you see any error response from the apiserver (in the browser network tab)? When the call to compare of 20+ task metrics succeed how much time does it usually takes in your environment?
As always, the server log ( trains-apiserver
for start) and more details from the browser's developer tools Network section would be appreciated 🙂
Is it possible to get a longer log file for the apiserver? From what I see, there's some kind of a connection pool issue