Can vouch, this works well. Had my server hard reboot (maybe bc of clearml? maybe bc of hardware, maybe both… haven’t figured it out), and busy remote workers still managed to update the backend once it came back up.
Re: backups… what would happen if zipped while running but no work was being performed? Still an issue potentially?
and what happens if docker compose down is run while there’s work in the services queue? Will it be restored? What are the implications if a backup is performed at this time and restored later?
@<1541954607595393024:profile|BattyCrocodile47> , shouldn't be an issue - ClearML SDK is resilient to connectivity issues so if the server goes down the SDK will continue running and will just store all the data locally, once server is back up, it will send everything that was waiting.
You have no idea what is committed to disk vs what is still contained in memory.
If you ran
docker-compose down and allowed ES to gracefully shut down, would ES finish writing everything to disk, therefore guaranteeing that the backups wouldn't get corrupted?
Earlier in the thread they mentioned that the agents are all resilient. So no ongoing tasks should be lost. I imagine even in a large organization, you could afford 5-10 minutes of downtime at 2AM or something.
That said, you'd only have 1 backup per day which could be a big deal depending on the experiments your running. You might want more than that.
The corresponding restore script would probably look like this
#!/bin/sh backup=$1 # requires this script to be called in the directory where the docker-compose file lives docker-compose down # preserve the current directory just in case mv /opt/clearml /opt/clearml-before-restore-$(date -u +%Y%m%dT%H%M) mkdir /opt/clearml tar -xvzf "$backup" -C / docker-compose up
You know, you could probably add some immortal containers to the
docker-compose.yml that use images with
mongodump and the ES equivalent installed.
The container(s) could have a bash script with a while loop in it that sleeps for 30 minutes and then does a backup. If you installed the AWS CLI inside, it could even take care of uploading to S3.
I like this idea, because
docker-compose.yml could make sure that if the backup container ever dies, it would be restarted.
Elasticsearch will potentially be corrupt when you run simple filesystem backups. You have no idea what is committed to disk vs what is still contained in memory. From experience I can tell you that a certain percentage of your backups will be corrupt and a restore will have usually a partial data loss or even a total since ES may simply refuse to start up and manually fixing the on-disk stuff is not practicable. Mongo file system snapshots at least used to be an acceptable backup mechanism (still seems to be the case None )
Ah, but it's probably worth noting that the
docker-compose.yml does register the EC2 isntance that the server is running on as an agent listening on the
services queue, so ongoing tasks in that queue that happen to be placed on the server would get terminated when
docker-compose down is run.