SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted everything and now previously running experiments are shown as aborted. I checked the clearml-agents and I can clearly see that the tasks are still running (high GPU/CPU load and processes still running). But then after they clearml-agents reconnect to the server, the tasks stop (no more processes running). Super weird.