Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
We Are Facing Performance Issues Of Our Self-Hosted Clearml Server Looking At The Cpu Utilization \ Memory \ Networking We Couldn'T Identify A Bottleneck We Are At The Moment Using ~100 Workers For Some Hpo, And The Main Performance Issues We Observe Are

we are facing performance issues of our self-hosted ClearML server
Looking at the cpu utilization \ memory \ networking we couldn't identify a bottleneck
we are at the moment using ~100 workers for some HPO, and the main performance issues we observe are :
UI super slow, a minute or two to open a project or logs available workers are not utilized even though there are many pending tasks on their queueany ideas how to tackle it?
AgitatedDove14 , AnxiousSeal95

  
  
Posted 2 years ago
Votes Newest

Answers 28


after poking the setup in multiple ways we came to a conclusion that the API server is being clogged by calls from multiple HPOptimziners, and it utilizes a single core so it seems like we are not able to scale it up properly... any ideas?

  
  
Posted 2 years ago

The api server by default spins multiple processes (they all might be busy a tye time with a huge flood of requests, but this is still multi process). Let me check if there is an easy way to set more processes

  
  
Posted 2 years ago

we see this:
$ ps ax | grep python
10589 ? S 0:05 python3 fileserver.py
10808 ? Sl 18:07 python3 -m apiserver.server
30047 pts/0 S+ 0:00 grep --color=auto python

  
  
Posted 2 years ago

if we look at the host machine we can see a single python process that is actually busy

Only one?! can you see the other python processes ?

  
  
Posted 2 years ago

AgitatedDove14 if we look at the host machine we can see a single python process that is actually busy

  
  
Posted 2 years ago

SuccessfulKoala55 can you refer me to how to increase the number of API server processes?

  
  
Posted 2 years ago

AgitatedDove14 thanks, at peak usage we have 6-8 gb of free RAM

  
  
Posted 2 years ago

How can we increase the number of API server processes?

  
  
Posted 2 years ago

DepressedChimpanzee34

I am actually curious now, why is the default like this? maybe more people are facing similar bottlenecks?

On "regular" load there is no need for multiple processes, and the memory consumption might be more important than reply lag (at least before you start to scale)
DisturbedWalrus17

By spawning multiple processes for the API server, it looks like we utilise the CPU more now but the UI and API calls are still lagging a lot

Can you try with even more processes?

  
  
Posted 2 years ago

Hi DepressedChimpanzee34 , took me a while but I think there is a solution:
In your docker file, replace:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L5
with
entrypoint: /bin/bash command: -c "mkdir -p /var/log/clearml && cd /opt/clearml/ && python3 -m apiserver.apierrors_generator && gunicorn -w 4 -t 600 --bind=0.0.0.0:8008 apiserver.server:app"

  
  
Posted 2 years ago

Hi AgitatedDove14 and SuccessfulKoala55 I just had a look at the machine stats. Max CPU usage is ~30% (of all the 4 cores). Average is more like 10% over a day or so. By spawning multiple processes for the API server, it looks like we utilise the CPU more now but the UI and API calls are still lagging a lot

  
  
Posted 2 years ago

AgitatedDove14 it ended up not solving our problem.. it was a false observation.. there is some bottleneck that kills the UI responsiveness that we can't identify

  
  
Posted 2 years ago

AgitatedDove14 I am actually curious now, why is the default like this? maybe more people are facing similar bottlenecks?

  
  
Posted 2 years ago

DepressedChimpanzee34 What are the CPU trends over time? any process taking up lots of CPU?

  
  
Posted 2 years ago

AgitatedDove14 , seem to work significantly better! thanks!

  
  
Posted 2 years ago

Ok - I've now tried with 8 workers instead of 4 and its the same. I should note that the apiserver container CPU usage is pretty low (~5-10% ). Also memory-wise it looks pretty in-spec to me. Below is a typical docker stats output when the server is behaving pretty sluggish
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 5e9160ba93d7 clearml-webserver 0.00% 5.996MiB / 7.446GiB 0.08% 803kB / 777kB 0B / 0B 6 e1596def9c4b clearml-apiserver 0.33% 429.7MiB / 7.446GiB 5.64% 50.9MB / 20.5MB 0B / 0B 82 7664869a2ab5 clearml-elastic 0.14% 3.524GiB / 7.446GiB 47.33% 2.21MB / 2.55MB 0B / 0B 85 67476e6b48d6 clearml-fileserver 0.01% 25.21MiB / 7.446GiB 0.33% 18.8kB / 0B 0B / 0B 2 a95f2a7f07e6 clearml-redis 0.06% 2.504MiB / 7.446GiB 0.03% 133kB / 51kB 0B / 0B 4 41d9155c7fa3 clearml-mongo 0.13% 997.1MiB / 7.446GiB 13.08% 4.62MB / 37.3MB 0B / 0B 55

  
  
Posted 2 years ago

what we observe is just general UI un-responsiveness. For example, opening a project or experiment page might take half a minute.

I should add: it seems to get worse when more workers are registered and more experiments are queued

  
  
Posted 2 years ago

Hmm we might need more detailed logs ...
When you say there is a lag, what exactly doe s that mean? if you have enough apiserver instances answering the requests, the bottleneck might be the mongo or the elastic ?

  
  
Posted 2 years ago

For example, opening a project or experiment page might take half a minute.

This implies mongodb performance issue
What's the size of the mongo DB?

  
  
Posted 2 years ago

What's the size of the mongo DB?

/opt/clearml/data/mongo/* has about 930M (if that's the right way of checking the size)

  
  
Posted 2 years ago

AFAIK, queued experiments have no impact on the system load

  
  
Posted 2 years ago

Would it help any further diagnotics if I upload the clearml-* (e.g. apiserver or mongo) logs? SuccessfulKoala55 AgitatedDove14

  
  
Posted 2 years ago

If you'd like, you can DM them

Thanks. I've sent them to you via DM.

  
  
Posted 2 years ago

If you'd like, you can DM them 🙂

  
  
Posted 2 years ago

Couldn't hurt 🙂

  
  
Posted 2 years ago

Hi DepressedChimpanzee34
I think main issue here is slow response time from the API server, I "think" you can increase the number of API server processes, but considering the 16GB, I'm not sure you have the headroom.
At peak usage, how much free RAM so you have on the machine ?

  
  
Posted 2 years ago

Hi DepressedChimpanzee34 ,
This is mainly a matter of scale - do you have a standard ClearML Server installation? What's your server spec? This is probably related to the number of internal API server handler processes as well as ES capacity/cpu/allocated memory

  
  
Posted 2 years ago

we have 8 core 16 gb ram, API server uses uses 1 core 100% and everything else seem to be in low utilization. it is a standard installation. how can we change the number of internal API server handler processes??

  
  
Posted 2 years ago