not entirely sure on this as we used the custom AMI solution, is there any documentation on it?
also, is there a list anywhere with the mount points that are needed?
container_name:
logging:
options:
max-size: 10m
hey @<1687643893996195840:profile|RoundCat60> .. did you ever get the problem sorted ?
Oh, that's strange. I'll run one of those soon to see if there's anything wrong with them
Hi @<1687643893996195840:profile|RoundCat60> ,
We've actually never had to address this issue. Can you find out what exactly is growing in size? I'd like to make sure this is not due to the containers storing data internally (causing docker to store more and more snapshots) - this is an unhealthy situation that might also indicate that volumes are not mounted correctly (i.e. data that should be stored externally is actually stored internally)
not yet, going to try and fix it today.
if I do a df
I see this, which is concerning:
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 928K 3.9G 1% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/nvme0n1p1 20G 7.9G 13G 40% /
tmpfs 790M 0 790M 0% /run/user/1000
so it looks like the mount points are not created. When do these get created? I thought using an AMI these would have already been setup?
It looks like not all the containers are up... Try sudo docker ps
and see if the apiserver container is restarting...
btw - if you remove the docker-compose changes, do the containers start normally?
Not necessarily, is there any data in those directories?
it looks like clearml-apiserver
and clearml-fileserver
are continually restarting
Basically whatever was under the old /opt/trains/
folder is required, you can see the list here: None
In the publicly available AMI these are created. However, if you used a previously released Trains AMI and upgraded to ClearML, part of the upgrade process was to create those directories (required by the new docker-compose.yml
), as explained here: None
yep, in most of them:
/opt/clearml/config
apiserver.conf
clearml.conf
/opt/clearml/data/elastic_7
/nodes
/opt/clearml/data/fileserver
<empty>
/opt/clearml/data/mongo/configdb
<empty>
/opt/clearml/data/mongo/db
collection/index files, /diagnostic.data, /journal etc
/opt/clearml/data/redis
dump.rdb
/opt/clearml/logs
apiserver.log.x, filserver.log (0 bytes)
I added this to each of the containers
logging:
options:
max-file: 5
max-size: 10m
back up and running again, thanks for your help
Good point @<1523715084633772032:profile|AlertBlackbird30> 👍
think I found the issue, a typo in apiserver.conf
@<1687643893996195840:profile|RoundCat60> you set it once, inside the docker-compose itself.. it will affect all docker containers but, to be honest, docker tends to log everything
so yes indeedly ..
sudo find /var/lib/ -type d -exec du -s -x -h {} \; | grep G | more
seems to give saner results.. of course, in your case, you may also want to grep M for megabyte
yeah, that's usually the case when you get an empty dashboard
I think that if these directories are not mounted, you should first of all take care not to shut down the server. You'll probably want to exec /bin/bash
into the mongo
and elastic
containers, and copy their data outside to the host storage
hhrrmm.. in the initial problem, you mentioned that the /var/lib/docker/overlay2 was growing large in size.. but.. 4GB seems "fine" for docker images.. I wonder .. does your nvme0n1p1 ever report like 85% or 90% used or do you think that the 4GB is a lot ? when you restart the server, does the % used noticeably drop ? that would suggest tmp files inside the docker image itself which.. is possible with docker (weird but, possible)
Check sudo docker logs <container-name>
you will probably want to find the culprit, so a find should work wonders. I probably suspect elasticsearch first. It tends to go nuts 😕
... from the AMI creation script:
# prepare directories to store data
sudo mkdir -p /opt/clearml/data/elastic_7
sudo mkdir -p /opt/clearml/data/redis
sudo mkdir -p /opt/clearml/data/mongo/db
sudo mkdir -p /opt/clearml/data/mongo/configdb
sudo mkdir -p /opt/clearml/logs
sudo mkdir -p /opt/clearml/config
sudo mkdir -p /opt/clearml/data/fileserver
sudo chown -R 1000:1000 /opt/clearml/data/elastic_7
So it seems the AMI is using the correct directories... Do you have these?
Hey there waves
Not sure about plans to automate this in the future, as this is more how docker behaves and not really clearml, especially with the overlay2 filesystem. The biggest offender usually is your json logfiles. have a look in /var/lib/docker/containers/ for *.log
assuming this IS the case, you can tell docker to only log upto a max-size .. I have mine set to 100m or some such
no, they are still rebooting. i've looked in /opt/clearml/logs/apiserver.log
no errors