I Have A Problem That Might Not Directly Be Clearml Related, But Maybe Someone Here Has An Idea: I Run A Clearml-Server On A Machine With 128Gb Ram, 32 Cores And 2 Gpus. On The Same Machine I Run 2 Clearml-Agent Each With Access To 1 Gpu, 12 Cores, An 48G

Answered

I have a problem that might not directly be ClearML related, but maybe someone here has an idea:
I run a clearml-server on a machine with 128GB RAM, 32 cores and 2 GPUs.
On the same machine I run 2 clearml-agent each with access to 1 GPU, 12 cores, an 48GB RAM in docker mode. I use the following option in my clearml.conf to limit resources:

extra_docker_arguments: [ "--memory-swap=48g", "--memory=48g", "--shm-size=48g", "--cpus=24" ] (cpus = 24 because of SMT).
So 96/128GB RAM for clearml-agents and the rest for clearml-server.

However, after a while my container will exit, but also the clearml-server stops responding correctly. WebUI will not show updates and only a few experiments are shown at all. After restarting the apiserver, the clearml-server works correctly again.

The weird thing is that 48GB per clearml-agent should be more than enough and on other machines that only host two clearml-agents everything runs fine indefinitely. On the machine with the server everytime I monitor memory only ~50GB/128GB are used at all.

Any idea what I do wrong? Btw: this is a hard to reproduce problem 😕
Only direction I see so far is something with shared memory, since some of my experiments may use a lot of shared memory, but I think this should be included in the limits?

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Votes Newest

Answers 14

I usually also experience no problems with restarting the clearml-server. It seems like it has to do with the OOM (or whatever issue I have).

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

CostlyOstrich36 Actually no container exits, so I guess if it s because of OOM like SuccessfulKoala55 implies, than maybe a process inside the container gets killed and the container will hang? Is this possible?
SuccessfulKoala55 I did not observe elastic to use much RAM (at least right after starting). Doesn't this line in the docker-compose control the RAM usage?
ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

than maybe a process inside the container gets killed and the container will hang? Is this possible?

I'm not sure. Usually if Elastic is unresponsive/not working properly the API server will have issues raising/working and will print out errors

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Shows some logs, but nothing of relevance I think. Only Infos and Warning about deprecated stuff that is still used ;D ...

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Could be clean log after restart. Unfortunately, I restarted the server right away 😞 I gonna post if it happens again with the appropriate logs.

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

128GB RAM, 32 cores and 2 GPUs.

WOW 😮 I'm so jealous

However, after a while my container will exit, but also the clearml-server stops responding correctly. WebUI will not show updates and only a few experiments are shown at all. After restarting the apiserver, the clearml-server works correctly again.

Do you get any errors on how/why the container exist? Which container is it?

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

What happens if you look at elastic container logs directly? I think it's something along the lines sudo docker logs clearml-elastic --follow . Don't catch me on the exact syntax naming tho 😛

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Are you sure this is not a clean log following a restart? If you do sudo docker ps , do all containers have roughly the same up time?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted everything and now previously running experiments are shown as aborted. I checked the clearml-agents and I can clearly see that the tasks are still running (high GPU/CPU load and processes still running). But then after they clearml-agents reconnect to the server, the tasks stop (no more processes running). Super weird.

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

ReassuredTiger98 I have a feeling this might be related to elastic search which usually preallocates a lot of RAM (half of the system's RAM, if I'm not mistaken?) and does not play nice with others.... Do you have any specific memory settings for ES in the clearml docker compose?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

When using agents on different machine this never happens (and we do server restarts and downtimes regularly with multiple experiments running almost daily)

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

This is very strange as we have lots of experience with server going down and experiments waiting and keep running - are you sure the agents keep going as usual?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I see, I just checked the logs and it shows
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f246f0d6c18>: Failed to establish a new connection: [Errno 111] Connection refused [2022-04-29 08:45:55,018] [9] [WARNING] [elasticsearch] POST [status:N/A request:0.000s]Unfortunetely, there are no logs in /usr/share/elasticsearch/logs to see what elastic was up to

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Yes, I am also talking about agents on different machines. I had two agents on the server machine, which also seem to have been killed. The ones on different machines kept working until 1 or 2 minutes after the clearml-server restarted.

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Write your answer

1K Views

14 Answers

2 years ago