Hello Periodically Under High Load, We Are Facing Too Long(>1 Sec) Processing Times For Requests Such As: Workers.Status_Report Events.Add_Batch Queues.Get_Next_Task. Also There Are Warnings "Connection Pool Is Full, Discarding Connection: Elasticsearch-S

Answered

Hello
Periodically under high load, we are facing too long(>1 sec) processing times for requests such as: workers.status_report events.add_batch queues.get_next_task.
Also there are warnings "Connection pool is full, discarding connection: elasticsearch-service"
Can you confirm that it's elastic search performance issue ?
Probably you have faced such issues and can recommend something.

  				
Posted 
	3 years ago

					More  		
  Report
		
					ItchyJellyfish73
				
					0
					 × 1

Votes Newest

Answers 10

Hmm are you getting the warning on the client side , or in the clearml-server ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

~30rps

  				
Posted 
	3 years ago

					More  		
  Report
		
					ItchyJellyfish73
				
					0
					 × 1

Seems the apiserver is out of connections, this is odd...
SuccessfulKoala55 do you have an idea ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

As I discovered, this was ES overload due to incorrect ClearML usage: report_scalar was called 100 times per sec(developer reported each sample from wav file). This didn't affect apieserver, because events were batched. Probably there should be some protection against overload on clearml package or apiserver level, as developers could do any crazy stuff in their code 🙃

  				
Posted 
	3 years ago

					More  		
  Report
		
					ItchyJellyfish73
				
					0
					 × 1

It's apiserver logs.

  				
Posted 
	3 years ago

					More  		
  Report
		
					ItchyJellyfish73
				
					0
					 × 1

Thanks for the report ItchyJellyfish73 , as far as I know such protections and QoS are supported in supported in the ClearML paid version

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi ItchyJellyfish73
This seems aligned with scenario you are describing, it seems the api server is overloaded with simultaneous connections.
Add an additional apiserver instance to the docker-compose and an nginx as load balancer:
https://github.com/allegroai/clearml-server/blob/09ab2af34cbf9a38f317e15d17454a2eb4c7efd0/docker/docker-compose.yml#L4
`
apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-stopped
<...>
networks:
- backend
- frontend
ports:
- "8008:18008"

apiserver_second:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-stopped
<...>
networks:
- backend
- frontend
ports:
- "8008:18009"

ngnix-server:
    image: nginx:1.13
    ports: 
        - "8008:8008"
    volumes: 
        - './ngnix.conf:/etc/ngnix/nginx.conf'
    networks: 
        - backend
    depends_on: 
        - apiserver
        - apiserver_second `Then in the local  ` ngnix.conf `  :

` events { worker_connections 1024;}

http {
upstream api {
server apiserver:18008;
server apiserver_second:18009;
}
server {
listen 8008;
location / {
proxy_pass ;
}
}
} `Notice I might have made a typo above, but generally speaking it should work

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Well, it.might simply be the elasticsearch driver reusing connections. Regarding the apiserver, the CPU load is not indicative - how many requests per second, approximately?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

AgitatedDove14 are you sure ? Api server has low CPU load( < 10% ). Moreover only requests related to ES are affected, other requests (like tasks.get_all or queues.get_all) are < 10ms

  				
Posted 
	3 years ago

					More  		
  Report
		
					ItchyJellyfish73
				
					0
					 × 1

Write your answer

1K Views

10 Answers

3 years ago

2 years ago