Thanks for the report ItchyJellyfish73 , as far as I know such protections and QoS are supported in supported in the ClearML paid version
As I discovered, this was ES overload due to incorrect ClearML usage: report_scalar was called 100 times per sec(developer reported each sample from wav file). This didn't affect apieserver, because events were batched. Probably there should be some protection against overload on clearml package or apiserver level, as developers could do any crazy stuff in their code 🙃
Hi ItchyJellyfish73
This seems aligned with scenario you are describing, it seems the api server is overloaded with simultaneous connections.
Add an additional apiserver instance to the docker-compose and an nginx as load balancer:
https://github.com/allegroai/clearml-server/blob/09ab2af34cbf9a38f317e15d17454a2eb4c7efd0/docker/docker-compose.yml#L4
`
apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-stopped
<...>
networks:
- backend
- frontend
ports:
- "8008:18008"
apiserver_second:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-stopped
<...>
networks:
- backend
- frontend
ports:
- "8008:18009"
ngnix-server:
image: nginx:1.13
ports:
- "8008:8008"
volumes:
- './ngnix.conf:/etc/ngnix/nginx.conf'
networks:
- backend
depends_on:
- apiserver
- apiserver_second `Then in the local ` ngnix.conf ` :
` events { worker_connections 1024;}
http {
upstream api {
server apiserver:18008;
server apiserver_second:18009;
}
server {
listen 8008;
location / {
proxy_pass ;
}
}
} `Notice I might have made a typo above, but generally speaking it should work
Well, it.might simply be the elasticsearch driver reusing connections. Regarding the apiserver, the CPU load is not indicative - how many requests per second, approximately?
Hmm are you getting the warning on the client side , or in the clearml-server ?
Seems the apiserver is out of connections, this is odd...
SuccessfulKoala55 do you have an idea ?
AgitatedDove14 are you sure ? Api server has low CPU load( < 10% ). Moreover only requests related to ES are affected, other requests (like tasks.get_all or queues.get_all) are < 10ms