How Would Ya'Ll Approach Backing Up The Elastic-Search/Redis/Etc. Data In Self-Hosted Clearml? Any Drawbacks/Risks Of Doing A Simple Process That Periodically Zips Up The

Answered

How would ya'll approach backing up the elastic-search/redis/etc. data in self-hosted ClearML?

Any drawbacks/risks of doing a simple process that periodically zips up the /opt/clearml volume mount folder and uploads it to S3? (besides losing data since the last backup)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Votes Newest

Answers 20

Oh, that is cool. I captured all this. Maybe I'll make a user-data.sh script and docker-compose.yml file that brings all these things together. Probably won't have time for a few weeks.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

yeah, for mongodump that would be the way to go I guess, for ES you're probably better of to simply make use of ES' built-in snapshot-lifecycle-management policies that can automate taking snapshots for you ( None )

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					TrickyGoose45
				
					0
					 × 1

As opposed to using CRON or something 🤣

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

You know, you could probably add some immortal containers to the docker-compose.yml that use images with mongodump and the ES equivalent installed.

The container(s) could have a bash script with a while loop in it that sleeps for 30 minutes and then does a backup. If you installed the AWS CLI inside, it could even take care of uploading to S3.

I like this idea, because docker-compose.yml could make sure that if the backup container ever dies, it would be restarted.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

of you want live backups (like backup every 30min or 1h) then you'll need to configure ES snapshots and probably periodically execute mongodump

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					TrickyGoose45
				
					0
					 × 1

Untested obviously

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					TrickyGoose45
				
					0
					 × 1

The corresponding restore script would probably look like this

#!/bin/sh

backup=$1
# requires this script to be called in the directory where the docker-compose file lives
docker-compose down
# preserve the current directory just in case
mv /opt/clearml /opt/clearml-before-restore-$(date -u +%Y%m%dT%H%M)
mkdir /opt/clearml
tar -xvzf "$backup" -C /
docker-compose up

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					TrickyGoose45
				
					0
					 × 1

Earlier in the thread they mentioned that the agents are all resilient. So no ongoing tasks should be lost. I imagine even in a large organization, you could afford 5-10 minutes of downtime at 2AM or something.

That said, you'd only have 1 backup per day which could be a big deal depending on the experiments your running. You might want more than that.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

That will probably work if you're happy with the setup being offline for a period of time

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					TrickyGoose45
				
					0
					 × 1

You have no idea what is committed to disk vs what is still contained in memory.

If you ran docker-compose down and allowed ES to gracefully shut down, would ES finish writing everything to disk, therefore guaranteeing that the backups wouldn't get corrupted?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Well, a simple version would be

#!/bin/sh

# requires this script to be called in the directory where the docker-compose file lives
docker-compose down
tar -cvpzf clearml-backup-$(date -u +%Y%m%dT%H%M) /opt/clearml
docker-compose up

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					TrickyGoose45
				
					0
					 × 1

We should put a $100 bounty on a bash script that backs up and restores mongodb, redis, and ES, etc. to S3 using the most resiliant ways 😄

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Elasticsearch will potentially be corrupt when you run simple filesystem backups. You have no idea what is committed to disk vs what is still contained in memory. From experience I can tell you that a certain percentage of your backups will be corrupt and a restore will have usually a partial data loss or even a total since ES may simply refuse to start up and manually fixing the on-disk stuff is not practicable. Mongo file system snapshots at least used to be an acceptable backup mechanism (still seems to be the case None )

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					TrickyGoose45
				
					0
					 × 1

Can vouch, this works well. Had my server hard reboot (maybe bc of clearml? maybe bc of hardware, maybe both… haven’t figured it out), and busy remote workers still managed to update the backend once it came back up.

Re: backups… what would happen if zipped while running but no work was being performed? Still an issue potentially?

and what happens if docker compose down is run while there’s work in the services queue? Will it be restored? What are the implications if a backup is performed at this time and restored later?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SoreSparrow36
				
					0
					 × 1

Ah, but it's probably worth noting that the docker-compose.yml does register the EC2 isntance that the server is running on as an agent listening on the services queue, so ongoing tasks in that queue that happen to be placed on the server would get terminated when docker-compose down is run.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Wow, that is seriously impressive.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

@<1541954607595393024:profile|BattyCrocodile47> , shouldn't be an issue - ClearML SDK is resilient to connectivity issues so if the server goes down the SDK will continue running and will just store all the data locally, once server is back up, it will send everything that was waiting.

Makes sense?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

@<1523701070390366208:profile|CostlyOstrich36> Oh that’s smart. Is that to make sure no transactions happen during the backup? Would there be a risk of ongoing or pending tasks somehow getting corrupted if you shut the server down?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

@<1541954607595393024:profile|BattyCrocodile47> , that is indeed the suggested method - although make sure that the server is down while doing this

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Also interested in how this is being approached 🙂 What you mentioned is exactly what I am doing

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					HarebrainedToad56
				
					0
					 × 1

Write your answer

2K Views

20 Answers

2 years ago