We Are Facing Performance Issues Of Our Self-Hosted Clearml Server Looking At The Cpu Utilization \ Memory \ Networking We Couldn'T Identify A Bottleneck We Are At The Moment Using ~100 Workers For Some Hpo, And The Main Performance Issues We Observe Are

Answered

we are facing performance issues of our self-hosted ClearML server
Looking at the cpu utilization \ memory \ networking we couldn't identify a bottleneck
we are at the moment using ~100 workers for some HPO, and the main performance issues we observe are :
UI super slow, a minute or two to open a project or logs available workers are not utilized even though there are many pending tasks on their queueany ideas how to tackle it?
AgitatedDove14 , AnxiousSeal95

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

Votes Newest

Answers 28

Hi DepressedChimpanzee34 , took me a while but I think there is a solution:
In your docker file, replace:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L5
with
entrypoint: /bin/bash command: -c "mkdir -p /var/log/clearml && cd /opt/clearml/ && python3 -m apiserver.apierrors_generator && gunicorn -w 4 -t 600 --bind=0.0.0.0:8008 apiserver.server:app"

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 it ended up not solving our problem.. it was a false observation.. there is some bottleneck that kills the UI responsiveness that we can't identify

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

AgitatedDove14 , seem to work significantly better! thanks!

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

DepressedChimpanzee34

I am actually curious now, why is the default like this? maybe more people are facing similar bottlenecks?

On "regular" load there is no need for multiple processes, and the memory consumption might be more important than reply lag (at least before you start to scale)
DisturbedWalrus17

By spawning multiple processes for the API server, it looks like we utilise the CPU more now but the UI and API calls are still lagging a lot

Can you try with even more processes?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

we see this:
$ ps ax | grep python
10589 ? S 0:05 python3 fileserver.py
10808 ? Sl 18:07 python3 -m apiserver.server
30047 pts/0 S+ 0:00 grep --color=auto python

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

Would it help any further diagnotics if I upload the clearml-* (e.g. apiserver or mongo) logs? SuccessfulKoala55 AgitatedDove14

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedWalrus17
				
					0
					 × 1

Hi DepressedChimpanzee34 ,
This is mainly a matter of scale - do you have a standard ClearML Server installation? What's your server spec? This is probably related to the number of internal API server handler processes as well as ES capacity/cpu/allocated memory

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

we have 8 core 16 gb ram, API server uses uses 1 core 100% and everything else seem to be in low utilization. it is a standard installation. how can we change the number of internal API server handler processes??

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

How can we increase the number of API server processes?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

if we look at the host machine we can see a single python process that is actually busy

Only one?! can you see the other python processes ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If you'd like, you can DM them 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

after poking the setup in multiple ways we came to a conclusion that the API server is being clogged by calls from multiple HPOptimziners, and it utilizes a single core so it seems like we are not able to scale it up properly... any ideas?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

what we observe is just general UI un-responsiveness. For example, opening a project or experiment page might take half a minute.

I should add: it seems to get worse when more workers are registered and more experiments are queued

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedWalrus17
				
					0
					 × 1

The api server by default spins multiple processes (they all might be busy a tye time with a huge flood of requests, but this is still multi process). Let me check if there is an easy way to set more processes

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ok - I've now tried with 8 workers instead of 4 and its the same. I should note that the apiserver container CPU usage is pretty low (~5-10% ). Also memory-wise it looks pretty in-spec to me. Below is a typical docker stats output when the server is behaving pretty sluggish
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 5e9160ba93d7 clearml-webserver 0.00% 5.996MiB / 7.446GiB 0.08% 803kB / 777kB 0B / 0B 6 e1596def9c4b clearml-apiserver 0.33% 429.7MiB / 7.446GiB 5.64% 50.9MB / 20.5MB 0B / 0B 82 7664869a2ab5 clearml-elastic 0.14% 3.524GiB / 7.446GiB 47.33% 2.21MB / 2.55MB 0B / 0B 85 67476e6b48d6 clearml-fileserver 0.01% 25.21MiB / 7.446GiB 0.33% 18.8kB / 0B 0B / 0B 2 a95f2a7f07e6 clearml-redis 0.06% 2.504MiB / 7.446GiB 0.03% 133kB / 51kB 0B / 0B 4 41d9155c7fa3 clearml-mongo 0.13% 997.1MiB / 7.446GiB 13.08% 4.62MB / 37.3MB 0B / 0B 55

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedWalrus17
				
					0
					 × 1

SuccessfulKoala55 can you refer me to how to increase the number of API server processes?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

If you'd like, you can DM them

Thanks. I've sent them to you via DM.

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedWalrus17
				
					0
					 × 1

Hi AgitatedDove14 and SuccessfulKoala55 I just had a look at the machine stats. Max CPU usage is ~30% (of all the 4 cores). Average is more like 10% over a day or so. By spawning multiple processes for the API server, it looks like we utilise the CPU more now but the UI and API calls are still lagging a lot

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedWalrus17
				
					0
					 × 1

Couldn't hurt 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi DepressedChimpanzee34
I think main issue here is slow response time from the API server, I "think" you can increase the number of API server processes, but considering the 16GB, I'm not sure you have the headroom.
At peak usage, how much free RAM so you have on the machine ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 thanks, at peak usage we have 6-8 gb of free RAM

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

For example, opening a project or experiment page might take half a minute.

This implies mongodb performance issue
What's the size of the mongo DB?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hmm we might need more detailed logs ...
When you say there is a lag, what exactly doe s that mean? if you have enough apiserver instances answering the requests, the bottleneck might be the mongo or the elastic ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

What's the size of the mongo DB?

/opt/clearml/data/mongo/* has about 930M (if that's the right way of checking the size)

  				
Posted 
	3 years ago

					More  		
  Report
		
					DisturbedWalrus17
				
					0
					 × 1

AFAIK, queued experiments have no impact on the system load

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

AgitatedDove14 I am actually curious now, why is the default like this? maybe more people are facing similar bottlenecks?

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

AgitatedDove14 if we look at the host machine we can see a single python process that is actually busy

  				
Posted 
	3 years ago

					More  		
  Report
		
					DepressedChimpanzee34
				
					0
					 × 1

DepressedChimpanzee34 What are the CPU trends over time? any process taking up lots of CPU?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

1K Views

28 Answers

3 years ago

2 years ago