we turn off the server every evening...
In that case the issue is definitely not related to the mount points
I think that if these directories are not mounted, you should first of all take care not to shut down the server. You'll probably want to exec /bin/bash
into the mongo
and elastic
containers, and copy their data outside to the host storage
so am I right in thinking it's just the mount points that are missing?based on the output of df
above
yep, in most of them:
/opt/clearml/config
apiserver.conf
clearml.conf
/opt/clearml/data/elastic_7
/nodes
/opt/clearml/data/fileserver
<empty>
/opt/clearml/data/mongo/configdb
<empty>
/opt/clearml/data/mongo/db
collection/index files, /diagnostic.data, /journal etc
/opt/clearml/data/redis
dump.rdb
/opt/clearml/logs
apiserver.log.x, filserver.log (0 bytes)
container_name:
logging:
options:
max-size: 10m
After making the change yesterday to the docker-compose file, the server is completely unusable - this is all I see for the /dashboard screen
so yes indeedly ..
sudo find /var/lib/ -type d -exec du -s -x -h {} \; | grep G | more
seems to give saner results.. of course, in your case, you may also want to grep M for megabyte
RoundCat60 you set it once, inside the docker-compose itself.. it will affect all docker containers but, to be honest, docker tends to log everything
back up and running again, thanks for your help
hey RoundCat60 .. did you ever get the problem sorted ?
In the publicly available AMI these are created. However, if you used a previously released Trains AMI and upgraded to ClearML, part of the upgrade process was to create those directories (required by the new docker-compose.yml
), as explained here: None
I believe you can set it on a 'per container' way as well.
RoundCat60 can you verify all the volume mounts point to existing directories on the server machine? (i.e. /opt/clearml/...
)
Oh, that's strange. I'll run one of those soon to see if there's anything wrong with them
Basically whatever was under the old /opt/trains/
folder is required, you can see the list here: None
you will probably want to find the culprit, so a find should work wonders. I probably suspect elasticsearch first. It tends to go nuts 😕
it looks like clearml-apiserver
and clearml-fileserver
are continually restarting
strange, I used one of the publicly available AMIs for ClearML (we did not upgrade from the Trains AMI as started fresh)
🤔 i'll add the logging max_size now and monitor over the next week
incidentally we turn off the server every evening as it's not used overnight, we've not faced issues with it starting up in the morning or noticed any data loss
Can you perhaps attach your docker-compose.yml
file's contents?
btw - if you remove the docker-compose changes, do the containers start normally?
... from the AMI creation script:
# prepare directories to store data
sudo mkdir -p /opt/clearml/data/elastic_7
sudo mkdir -p /opt/clearml/data/redis
sudo mkdir -p /opt/clearml/data/mongo/db
sudo mkdir -p /opt/clearml/data/mongo/configdb
sudo mkdir -p /opt/clearml/logs
sudo mkdir -p /opt/clearml/config
sudo mkdir -p /opt/clearml/data/fileserver
sudo chown -R 1000:1000 /opt/clearml/data/elastic_7
So it seems the AMI is using the correct directories... Do you have these?
also, is there a list anywhere with the mount points that are needed?
think I found the issue, a typo in apiserver.conf
hhrrmm.. in the initial problem, you mentioned that the /var/lib/docker/overlay2 was growing large in size.. but.. 4GB seems "fine" for docker images.. I wonder .. does your nvme0n1p1 ever report like 85% or 90% used or do you think that the 4GB is a lot ? when you restart the server, does the % used noticeably drop ? that would suggest tmp files inside the docker image itself which.. is possible with docker (weird but, possible)
yeah, that's usually the case when you get an empty dashboard
It looks like not all the containers are up... Try sudo docker ps
and see if the apiserver container is restarting...