CostlyOstrich36 Actually no container exits, so I guess if it s because of OOM like SuccessfulKoala55 implies, than maybe a process inside the container gets killed and the container will hang? Is this possible?
SuccessfulKoala55 I did not observe elastic to use much RAM (at least right after starting). Doesn't this line in the docker-compose control the RAM usage?
ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
128GB RAM, 32 cores and 2 GPUs.
WOW 😮 I'm so jealous
However, after a while my container will exit, but also the clearml-server stops responding correctly. WebUI will not show updates and only a few experiments are shown at all. After restarting the apiserver, the clearml-server works correctly again.
Do you get any errors on how/why the container exist? Which container is it?
than maybe a process inside the container gets killed and the container will hang? Is this possible?
I'm not sure. Usually if Elastic is unresponsive/not working properly the API server will have issues raising/working and will print out errors
SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted everything and now previously running experiments are shown as aborted. I checked the clearml-agents and I can clearly see that the tasks are still running (high GPU/CPU load and processes still running). But then after they clearml-agents reconnect to the server, the tasks stop (no more processes running). Super weird.
ReassuredTiger98 I have a feeling this might be related to elastic search which usually preallocates a lot of RAM (half of the system's RAM, if I'm not mistaken?) and does not play nice with others.... Do you have any specific memory settings for ES in the clearml docker compose?
I see, I just checked the logs and it shows
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f246f0d6c18>: Failed to establish a new connection: [Errno 111] Connection refused [2022-04-29 08:45:55,018]  [WARNING] [elasticsearch] POST
[status:N/A request:0.000s]Unfortunetely, there are no logs in
/usr/share/elasticsearch/logs to see what elastic was up to