I think main issue here is slow response time from the API server, I "think" you can increase the number of API server processes, but considering the 16GB, I'm not sure you have the headroom.
At peak usage, how much free RAM so you have on the machine ?
Ok - I've now tried with 8 workers instead of 4 and its the same. I should note that the apiserver container CPU usage is pretty low (~5-10% ). Also memory-wise it looks pretty in-spec to me. Below is a typical docker stats output when the server is behaving pretty sluggish
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 5e9160ba93d7 clearml-webserver 0.00% 5.996MiB / 7.446GiB 0.08% 803kB / 777kB 0B / 0B 6 e1596def9c4b clearml-apiserver 0.33% 429.7MiB / 7.446GiB 5.64% 50.9MB / 20.5MB 0B / 0B 82 7664869a2ab5 clearml-elastic 0.14% 3.524GiB / 7.446GiB 47.33% 2.21MB / 2.55MB 0B / 0B 85 67476e6b48d6 clearml-fileserver 0.01% 25.21MiB / 7.446GiB 0.33% 18.8kB / 0B 0B / 0B 2 a95f2a7f07e6 clearml-redis 0.06% 2.504MiB / 7.446GiB 0.03% 133kB / 51kB 0B / 0B 4 41d9155c7fa3 clearml-mongo 0.13% 997.1MiB / 7.446GiB 13.08% 4.62MB / 37.3MB 0B / 0B 55
Hi AgitatedDove14 and SuccessfulKoala55 I just had a look at the machine stats. Max CPU usage is ~30% (of all the 4 cores). Average is more like 10% over a day or so. By spawning multiple processes for the API server, it looks like we utilise the CPU more now but the UI and API calls are still lagging a lot
I am actually curious now, why is the default like this? maybe more people are facing similar bottlenecks?
On "regular" load there is no need for multiple processes, and the memory consumption might be more important than reply lag (at least before you start to scale)
By spawning multiple processes for the API server, it looks like we utilise the CPU more now but the UI and API calls are still lagging a lot
Can you try with even more processes?
Hi DepressedChimpanzee34 ,
This is mainly a matter of scale - do you have a standard ClearML Server installation? What's your server spec? This is probably related to the number of internal API server handler processes as well as ES capacity/cpu/allocated memory
after poking the setup in multiple ways we came to a conclusion that the API server is being clogged by calls from multiple HPOptimziners, and it utilizes a single core so it seems like we are not able to scale it up properly... any ideas?
Hi DepressedChimpanzee34 , took me a while but I think there is a solution:
In your docker file, replace:
entrypoint: /bin/bash command: -c "mkdir -p /var/log/clearml && cd /opt/clearml/ && python3 -m apiserver.apierrors_generator && gunicorn -w 4 -t 600 --bind=0.0.0.0:8008 apiserver.server:app"