we turn off the server every evening...
In that case the issue is definitely not related to the mount points
@<1687643893996195840:profile|RoundCat60> can you verify all the volume mounts point to existing directories on the server machine? (i.e. /opt/clearml/... )
so yes indeedly ..
sudo find /var/lib/ -type d -exec du -s -x -h {} \; | grep G | more
seems to give saner results.. of course, in your case, you may also want to grep M for megabyte
I think that if these directories are not mounted, you should first of all take care not to shut down the server. You'll probably want to exec /bin/bash into the mongo and elastic containers, and copy their data outside to the host storage
Basically whatever was under the old /opt/trains/ folder is required, you can see the list here: None
not yet, going to try and fix it today.
if I do a df I see this, which is concerning:
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 928K 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/nvme0n1p1 20G 7.9G 13G 40% /
tmpfs 790M 0 790M 0% /run/user/1000
so it looks like the mount points are not created. When do these get created? I thought using an AMI these would have already been setup?
thanks @<1523715084633772032:profile|AlertBlackbird30> this is really informative. Nothing seems to be particularly out of the ordinary though
3.7G /var/lib/
3.7G /var/lib/docker
3.0G /var/lib/docker/overlay2
followed by a whole load of files that are a few hundred KBs in size, nothing huge though
🤔 i'll add the logging max_size now and monitor over the next week
also, is there a list anywhere with the mount points that are needed?
Not necessarily, is there any data in those directories?
Howdy and Morning @<1687643893996195840:profile|RoundCat60> .. docker when using overlay2 doesn't have it's mount points show up in a 'df' btw, they will only appear in a 'df -a', mostly because since they are simply 'overlays', they don't (technically) consume any space (I mean, the files are still in the /var/lib but not for the space counting practices used by df)
this is why I was suggesting a find, maybe with a 'du' .. actually.. let me try that here.. 2s
strange, I used one of the publicly available AMIs for ClearML (we did not upgrade from the Trains AMI as started fresh)
yeah, that's usually the case when you get an empty dashboard
no, they are still rebooting. i've looked in /opt/clearml/logs/apiserver.log no errors
back up and running again, thanks for your help
container_name:
logging:
options:
max-size: 10m
it looks like clearml-apiserver and clearml-fileserver are continually restarting
so am I right in thinking it's just the mount points that are missing?based on the output of df above
In the publicly available AMI these are created. However, if you used a previously released Trains AMI and upgraded to ClearML, part of the upgrade process was to create those directories (required by the new docker-compose.yml ), as explained here: None
think I found the issue, a typo in apiserver.conf
btw - if you remove the docker-compose changes, do the containers start normally?
you will probably want to find the culprit, so a find should work wonders. I probably suspect elasticsearch first. It tends to go nuts 😕
Check sudo docker logs <container-name>
hhrrmm.. in the initial problem, you mentioned that the /var/lib/docker/overlay2 was growing large in size.. but.. 4GB seems "fine" for docker images.. I wonder .. does your nvme0n1p1 ever report like 85% or 90% used or do you think that the 4GB is a lot ? when you restart the server, does the % used noticeably drop ? that would suggest tmp files inside the docker image itself which.. is possible with docker (weird but, possible)
After making the change yesterday to the docker-compose file, the server is completely unusable - this is all I see for the /dashboard screen
incidentally we turn off the server every evening as it's not used overnight, we've not faced issues with it starting up in the morning or noticed any data loss
Can you perhaps attach your docker-compose.yml file's contents?
It looks like not all the containers are up... Try sudo docker ps and see if the apiserver container is restarting...