Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
I Have A Problem That Might Not Directly Be Clearml Related, But Maybe Someone Here Has An Idea: I Run A Clearml-Server On A Machine With 128Gb Ram, 32 Cores And 2 Gpus. On The Same Machine I Run 2 Clearml-Agent Each With Access To 1 Gpu, 12 Cores, An 48G

I have a problem that might not directly be ClearML related, but maybe someone here has an idea:
I run a clearml-server on a machine with 128GB RAM, 32 cores and 2 GPUs.
On the same machine I run 2 clearml-agent each with access to 1 GPU, 12 cores, an 48GB RAM in docker mode. I use the following option in my clearml.conf to limit resources:

extra_docker_arguments: [ "--memory-swap=48g", "--memory=48g", "--shm-size=48g", "--cpus=24" ] (cpus = 24 because of SMT).
So 96/128GB RAM for clearml-agents and the rest for clearml-server.

However, after a while my container will exit, but also the clearml-server stops responding correctly. WebUI will not show updates and only a few experiments are shown at all. After restarting the apiserver, the clearml-server works correctly again.

The weird thing is that 48GB per clearml-agent should be more than enough and on other machines that only host two clearml-agents everything runs fine indefinitely. On the machine with the server everytime I monitor memory only ~50GB/128GB are used at all.

Any idea what I do wrong? Btw: this is a hard to reproduce problem 😕
Only direction I see so far is something with shared memory, since some of my experiments may use a lot of shared memory, but I think this should be included in the limits?

  
  
Posted 2 years ago
Votes Newest

Answers 14


I see, I just checked the logs and it shows
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f246f0d6c18>: Failed to establish a new connection: [Errno 111] Connection refused [2022-04-29 08:45:55,018] [9] [WARNING] [elasticsearch] POST [status:N/A request:0.000s]Unfortunetely, there are no logs in /usr/share/elasticsearch/logs to see what elastic was up to

  
  
Posted 2 years ago

SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted everything and now previously running experiments are shown as aborted. I checked the clearml-agents and I can clearly see that the tasks are still running (high GPU/CPU load and processes still running). But then after they clearml-agents reconnect to the server, the tasks stop (no more processes running). Super weird.

  
  
Posted 2 years ago

Are you sure this is not a clean log following a restart? If you do sudo docker ps , do all containers have roughly the same up time?

  
  
Posted 2 years ago

Could be clean log after restart. Unfortunately, I restarted the server right away 😞 I gonna post if it happens again with the appropriate logs.

  
  
Posted 2 years ago

This is very strange as we have lots of experience with server going down and experiments waiting and keep running - are you sure the agents keep going as usual?

  
  
Posted 2 years ago

CostlyOstrich36 Actually no container exits, so I guess if it s because of OOM like SuccessfulKoala55 implies, than maybe a process inside the container gets killed and the container will hang? Is this possible?
SuccessfulKoala55 I did not observe elastic to use much RAM (at least right after starting). Doesn't this line in the docker-compose control the RAM usage?
ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true

  
  
Posted 2 years ago

Shows some logs, but nothing of relevance I think. Only Infos and Warning about deprecated stuff that is still used ;D ...

  
  
Posted 2 years ago

than maybe a process inside the container gets killed and the container will hang? Is this possible?

I'm not sure. Usually if Elastic is unresponsive/not working properly the API server will have issues raising/working and will print out errors

  
  
Posted 2 years ago

What happens if you look at elastic container logs directly? I think it's something along the lines sudo docker logs clearml-elastic --follow . Don't catch me on the exact syntax naming tho 😛

  
  
Posted 2 years ago

ReassuredTiger98 I have a feeling this might be related to elastic search which usually preallocates a lot of RAM (half of the system's RAM, if I'm not mistaken?) and does not play nice with others.... Do you have any specific memory settings for ES in the clearml docker compose?

  
  
Posted 2 years ago

128GB RAM, 32 cores and 2 GPUs.

WOW 😮 I'm so jealous

However, after a while my container will exit, but also the clearml-server stops responding correctly. WebUI will not show updates and only a few experiments are shown at all. After restarting the apiserver, the clearml-server works correctly again.

Do you get any errors on how/why the container exist? Which container is it?

  
  
Posted 2 years ago

Yes, I am also talking about agents on different machines. I had two agents on the server machine, which also seem to have been killed. The ones on different machines kept working until 1 or 2 minutes after the clearml-server restarted.

  
  
Posted 2 years ago

When using agents on different machine this never happens (and we do server restarts and downtimes regularly with multiple experiments running almost daily)

  
  
Posted 2 years ago

I usually also experience no problems with restarting the clearml-server. It seems like it has to do with the OOM (or whatever issue I have).

  
  
Posted 2 years ago