SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted everything and now previously running experiments are shown as aborted. I checked the clearml-agents and I can clearly see that the tasks are still running (high GPU/CPU load and processes still running). But then after they clearml-agents reconnect to the server, the tasks stop (no more processes running). Super weird.
Could be clean log after restart. Unfortunately, I restarted the server right away 😞 I gonna post if it happens again with the appropriate logs.
This is very strange as we have lots of experience with server going down and experiments waiting and keep running - are you sure the agents keep going as usual?
than maybe a process inside the container gets killed and the container will hang? Is this possible?
I'm not sure. Usually if Elastic is unresponsive/not working properly the API server will have issues raising/working and will print out errors
I usually also experience no problems with restarting the clearml-server. It seems like it has to do with the OOM (or whatever issue I have).
What happens if you look at elastic container logs directly? I think it's something along the lines sudo docker logs clearml-elastic --follow
. Don't catch me on the exact syntax naming tho 😛
CostlyOstrich36 Actually no container exits, so I guess if it s because of OOM like SuccessfulKoala55 implies, than maybe a process inside the container gets killed and the container will hang? Is this possible?
SuccessfulKoala55 I did not observe elastic to use much RAM (at least right after starting). Doesn't this line in the docker-compose control the RAM usage?ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
Yes, I am also talking about agents on different machines. I had two agents on the server machine, which also seem to have been killed. The ones on different machines kept working until 1 or 2 minutes after the clearml-server restarted.
Shows some logs, but nothing of relevance I think. Only Infos and Warning about deprecated stuff that is still used ;D ...
ReassuredTiger98 I have a feeling this might be related to elastic search which usually preallocates a lot of RAM (half of the system's RAM, if I'm not mistaken?) and does not play nice with others.... Do you have any specific memory settings for ES in the clearml docker compose?
I see, I just checked the logs and it showsurllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f246f0d6c18>: Failed to establish a new connection: [Errno 111] Connection refused [2022-04-29 08:45:55,018] [9] [WARNING] [elasticsearch] POST
[status:N/A request:0.000s]
Unfortunetely, there are no logs in /usr/share/elasticsearch/logs
to see what elastic was up to
128GB RAM, 32 cores and 2 GPUs.
WOW 😮 I'm so jealous
However, after a while my container will exit, but also the clearml-server stops responding correctly. WebUI will not show updates and only a few experiments are shown at all. After restarting the apiserver, the clearml-server works correctly again.
Do you get any errors on how/why the container exist? Which container is it?
Are you sure this is not a clean log following a restart? If you do sudo docker ps
, do all containers have roughly the same up time?
When using agents on different machine this never happens (and we do server restarts and downtimes regularly with multiple experiments running almost daily)