Hi DepressedChimpanzee34 , took me a while but I think there is a solution:
In your docker file, replace:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L5
withentrypoint: /bin/bash command: -c "mkdir -p /var/log/clearml && cd /opt/clearml/ && python3 -m apiserver.apierrors_generator && gunicorn -w 4 -t 600 --bind=0.0.0.0:8008 apiserver.server:app"
we have 8 core 16 gb ram, API server uses uses 1 core 100% and everything else seem to be in low utilization. it is a standard installation. how can we change the number of internal API server handler processes??
How can we increase the number of API server processes?
DepressedChimpanzee34
I am actually curious now, why is the default like this? maybe more people are facing similar bottlenecks?
On "regular" load there is no need for multiple processes, and the memory consumption might be more important than reply lag (at least before you start to scale)
DisturbedWalrus17
By spawning multiple processes for the API server, it looks like we utilise the CPU more now but the UI and API calls are still lagging a lot
Can you try with even more processes?
after poking the setup in multiple ways we came to a conclusion that the API server is being clogged by calls from multiple HPOptimziners, and it utilizes a single core so it seems like we are not able to scale it up properly... any ideas?
we see this:
$ ps ax | grep python
10589 ? S 0:05 python3 fileserver.py
10808 ? Sl 18:07 python3 -m apiserver.server
30047 pts/0 S+ 0:00 grep --color=auto python
Would it help any further diagnotics if I upload the clearml-* (e.g. apiserver or mongo) logs? SuccessfulKoala55 AgitatedDove14
Hi DepressedChimpanzee34
I think main issue here is slow response time from the API server, I "think" you can increase the number of API server processes, but considering the 16GB, I'm not sure you have the headroom.
At peak usage, how much free RAM so you have on the machine ?
SuccessfulKoala55 can you refer me to how to increase the number of API server processes?
If you'd like, you can DM them
Thanks. I've sent them to you via DM.
What's the size of the mongo DB?
/opt/clearml/data/mongo/* has about 930M (if that's the right way of checking the size)
DepressedChimpanzee34 What are the CPU trends over time? any process taking up lots of CPU?
Hi AgitatedDove14 and SuccessfulKoala55 I just had a look at the machine stats. Max CPU usage is ~30% (of all the 4 cores). Average is more like 10% over a day or so. By spawning multiple processes for the API server, it looks like we utilise the CPU more now but the UI and API calls are still lagging a lot
Hi DepressedChimpanzee34 ,
This is mainly a matter of scale - do you have a standard ClearML Server installation? What's your server spec? This is probably related to the number of internal API server handler processes as well as ES capacity/cpu/allocated memory
AgitatedDove14 , seem to work significantly better! thanks!
if we look at the host machine we can see a single python process that is actually busy
Only one?! can you see the other python processes ?
AgitatedDove14 it ended up not solving our problem.. it was a false observation.. there is some bottleneck that kills the UI responsiveness that we can't identify
AgitatedDove14 thanks, at peak usage we have 6-8 gb of free RAM
For example, opening a project or experiment page might take half a minute.
This implies mongodb performance issue
What's the size of the mongo DB?
AgitatedDove14 if we look at the host machine we can see a single python process that is actually busy
Hmm we might need more detailed logs ...
When you say there is a lag, what exactly doe s that mean? if you have enough apiserver instances answering the requests, the bottleneck might be the mongo or the elastic ?
AFAIK, queued experiments have no impact on the system load
The api server by default spins multiple processes (they all might be busy a tye time with a huge flood of requests, but this is still multi process). Let me check if there is an easy way to set more processes
what we observe is just general UI un-responsiveness. For example, opening a project or experiment page might take half a minute.
I should add: it seems to get worse when more workers are registered and more experiments are queued
AgitatedDove14 I am actually curious now, why is the default like this? maybe more people are facing similar bottlenecks?
Ok - I've now tried with 8 workers instead of 4 and its the same. I should note that the apiserver container CPU usage is pretty low (~5-10% ). Also memory-wise it looks pretty in-spec to me. Below is a typical docker stats output when the server is behaving pretty sluggishCONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 5e9160ba93d7 clearml-webserver 0.00% 5.996MiB / 7.446GiB 0.08% 803kB / 777kB 0B / 0B 6 e1596def9c4b clearml-apiserver 0.33% 429.7MiB / 7.446GiB 5.64% 50.9MB / 20.5MB 0B / 0B 82 7664869a2ab5 clearml-elastic 0.14% 3.524GiB / 7.446GiB 47.33% 2.21MB / 2.55MB 0B / 0B 85 67476e6b48d6 clearml-fileserver 0.01% 25.21MiB / 7.446GiB 0.33% 18.8kB / 0B 0B / 0B 2 a95f2a7f07e6 clearml-redis 0.06% 2.504MiB / 7.446GiB 0.03% 133kB / 51kB 0B / 0B 4 41d9155c7fa3 clearml-mongo 0.13% 997.1MiB / 7.446GiB 13.08% 4.62MB / 37.3MB 0B / 0B 55