TenseOstrich47 this looks like elasticserach is out of space...
@<1523701157564780544:profile|TenseOstrich47> The storage in question here is what's available on the machine hosting the ClearML server's docker containers (specifically, the ES one).
After some additional inspection, seems like the issue is docker related.7.7G /var/lib/docker/overlay2/
this is the directory which is causing the device storage issues.
From what I can tell, docker has some leakage here. Temp files are not removed correctly, resulting in the build up of disk storage usage.
See the following for more details
https://stackoverflow.com/questions/46672001/is-it-safe-to-clean-docker-overlay2
https://forums.docker.com/t/some-way-to-clean-up-identify-contents-of-var-lib-docker-overlay/30604
https://docs.docker.com/storage/storagedriver/overlayfs-driver/
Im going to write a clean up script and add that to the cron. I dont believe there is an easy way around this issue as docker trades off disk storage for simplicity
@<1523701157564780544:profile|TenseOstrich47> This is typically indicative of insufficient server disk space causing ES to go into read-only mode or turn active shards into inactive or unassigned (see FAQ ).
The disk watermarks controlling the ES free-disk constraints are defined by default as % of the disk space (so it might look to you like you still have plenty of space, but ES thinks otherwise). You can configure different ES settings in the docker-compose.yml file (see here - there are 3 settings, all can be identical)
If you don't have enough free disk space, clean up files to create more, or resize your partition (or increase your disk size if on a cloud instance).
ES can't use s3 for storage, nor can MongoDB
Thanks Jake, I will have a look. Is there a reason a lot disk space would be used on the server instance? Is there something in the config I can change to ensure that minimal memory is used on that server, and mostly s3 is used for storage?
TenseOstrich47 see here: https://github.com/allegroai/clearml/issues/316#issuecomment-788995387
that should be the case, we have default_output_uri:
set to an s3 bucket
I thought nothing should be stored locally on the agent? Shouldn't all files be logged to the storage rather than the instance itself?
@<1687643893996195840:profile|RoundCat60> Hey Alex. Could you take a look at this when you're free later on please
TenseOstrich47 this sounds like a good idea.
When you have a script, please feel free to share, I think it will be useful for other users as well 🙂