Oh, that is cool. I captured all this. Maybe I'll make a user-data.sh
script and docker-compose.yml
file that brings all these things together. Probably won't have time for a few weeks.
yeah, for mongodump that would be the way to go I guess, for ES you're probably better of to simply make use of ES' built-in snapshot-lifecycle-management policies that can automate taking snapshots for you ( None )
As opposed to using CRON or something 🤣
You know, you could probably add some immortal containers to the docker-compose.yml
that use images with mongodump
and the ES equivalent installed.
The container(s) could have a bash script with a while loop in it that sleeps for 30 minutes and then does a backup. If you installed the AWS CLI inside, it could even take care of uploading to S3.
I like this idea, because docker-compose.yml
could make sure that if the backup container ever dies, it would be restarted.
of you want live backups (like backup every 30min or 1h) then you'll need to configure ES snapshots and probably periodically execute mongodump
The corresponding restore script would probably look like this
#!/bin/sh
backup=$1
# requires this script to be called in the directory where the docker-compose file lives
docker-compose down
# preserve the current directory just in case
mv /opt/clearml /opt/clearml-before-restore-$(date -u +%Y%m%dT%H%M)
mkdir /opt/clearml
tar -xvzf "$backup" -C /
docker-compose up
Earlier in the thread they mentioned that the agents are all resilient. So no ongoing tasks should be lost. I imagine even in a large organization, you could afford 5-10 minutes of downtime at 2AM or something.
That said, you'd only have 1 backup per day which could be a big deal depending on the experiments your running. You might want more than that.
That will probably work if you're happy with the setup being offline for a period of time
You have no idea what is committed to disk vs what is still contained in memory.
If you ran docker-compose down
and allowed ES to gracefully shut down, would ES finish writing everything to disk, therefore guaranteeing that the backups wouldn't get corrupted?
Well, a simple version would be
#!/bin/sh
# requires this script to be called in the directory where the docker-compose file lives
docker-compose down
tar -cvpzf clearml-backup-$(date -u +%Y%m%dT%H%M) /opt/clearml
docker-compose up
We should put a $100 bounty on a bash script that backs up and restores mongodb, redis, and ES, etc. to S3 using the most resiliant ways 😄
Elasticsearch will potentially be corrupt when you run simple filesystem backups. You have no idea what is committed to disk vs what is still contained in memory. From experience I can tell you that a certain percentage of your backups will be corrupt and a restore will have usually a partial data loss or even a total since ES may simply refuse to start up and manually fixing the on-disk stuff is not practicable. Mongo file system snapshots at least used to be an acceptable backup mechanism (still seems to be the case None )
Can vouch, this works well. Had my server hard reboot (maybe bc of clearml? maybe bc of hardware, maybe both… haven’t figured it out), and busy remote workers still managed to update the backend once it came back up.
Re: backups… what would happen if zipped while running but no work was being performed? Still an issue potentially?
and what happens if docker compose down is run while there’s work in the services queue? Will it be restored? What are the implications if a backup is performed at this time and restored later?
Ah, but it's probably worth noting that the docker-compose.yml
does register the EC2 isntance that the server is running on as an agent listening on the services
queue, so ongoing tasks in that queue that happen to be placed on the server would get terminated when docker-compose down
is run.
@<1541954607595393024:profile|BattyCrocodile47> , shouldn't be an issue - ClearML SDK is resilient to connectivity issues so if the server goes down the SDK will continue running and will just store all the data locally, once server is back up, it will send everything that was waiting.
Makes sense?
@<1523701070390366208:profile|CostlyOstrich36> Oh that’s smart. Is that to make sure no transactions happen during the backup? Would there be a risk of ongoing or pending tasks somehow getting corrupted if you shut the server down?
@<1541954607595393024:profile|BattyCrocodile47> , that is indeed the suggested method - although make sure that the server is down while doing this
Also interested in how this is being approached 🙂 What you mentioned is exactly what I am doing