Syncing multiple GBs from EFS to a remote S3 account

Whilst shutting down some old projects, I’m archiving all of their folders and storing them in S3 in case of an emergency.

I don’t have AWS credentials stored and set up on the EC2 instance, so I’m passing them through as variables to the script below.

As with the script to export tens of thousands of individual MySQL databases, you’ll want to edit the values at the top.

EFS_MOUNT_POINT is the full path to where the EFS drive is mounted
S3_BUCKET is your target S3 bucket
AWS_… self explanatory!
ARCHIVE_DIR is the full path to temporarily store the archives
LOG_FILE is the path to the log file for this sync action
DEPTH sets the depth before creating a new archive (see below)

The $DEPTH value lets you control when a new archive will be created. In my situation, I have a directory structure like:

/efs/vhosts/somesite.com/
/efs/vhosts/differentsite.com/
/efs/vhosts/finalsite.com/
/efs/logs/somesite.com/
/efs/logs/differentsite.com/
/efs/logs/finalsite.com/
… and so on

I don’t want a single giant archive containing everything, so I’ll set my $DEPTH to 2 so that I end up with an archive for each vhost, each log directory, etc.

Let’s look at the script now:

bash sync.sh


#!/bin/bash

# Variables
EFS_MOUNT_POINT="/path/to/efs/mount"
S3_BUCKET="s3://your-target-s3-bucket"
AWS_ACCESS_KEY_ID="your-access-key-id"
AWS_SECRET_ACCESS_KEY="your-secret-access-key"
AWS_REGION="us-east-1"
ARCHIVE_DIR="/tmp/efs_archives"
LOG_FILE="/tmp/sync_log.txt"
DEPTH=2

# Ensure the archive directory exists
mkdir -p $ARCHIVE_DIR

# Clear the log file before we start
> $LOG_FILE

# Read and iterate the folders
cd $EFS_MOUNT_POINT

total_dirs=$(find . -mindepth $DEPTH -maxdepth $DEPTH -type d | wc -l)
current_dir=0
start_time=$(date +%s)

find . -mindepth $DEPTH -maxdepth $DEPTH -type d | while read -r dir; do
  ((current_dir++))
  archive_path="${ARCHIVE_DIR}$(echo "$dir" | sed 's|/|-|g').tar.gz"
  archive_name=$(basename "$archive_path" | sed 's/\./-/g')

  echo "[$(date +'%Y-%m-%d %H:%M:%S')] Starting sync for: $archive_name ($current_dir of $total_dirs)"
  
  # Create the GZIP archive
  tar -czf "$archive_path" "$dir"
  if [ $? -ne 0 ]; then
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] [$current_dir/$total_dirs] Failed to create archive for $dir" >> $LOG_FILE
    continue
  fi
  
  # Sync the archive to S3
  AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY aws s3 cp "$archive_path" "$S3_BUCKET" --region $AWS_REGION
  if [ $? -ne 0 ]; then
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] [$current_dir/$total_dirs] Failed to sync $archive_name to S3" >> $LOG_FILE
    continue
  fi
  
  # Remove the local archive
  rm "$archive_path"
  if [ $? -ne 0 ]; then
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] [$current_dir/$total_dirs] Failed to delete local archive $archive_name" >> $LOG_FILE
    continue
  fi

  # Estimate remaining time
  current_time=$(date +%s)
  elapsed_time=$((current_time - start_time))
  avg_time_per_dir=$((elapsed_time / current_dir))
  remaining_time=$((avg_time_per_dir * (total_dirs - current_dir)))
  echo "[$(date +'%Y-%m-%d %H:%M:%S')] [$current_dir/$total_dirs] Successfully processed $dir. Estimated time remaining: $((remaining_time / 60)) minutes and $((remaining_time % 60)) seconds"
done

# End time
end_time=$(date +%s)
total_duration=$((end_time - start_time))
echo "[$(date +'%Y-%m-%d %H:%M:%S')] All backups completed in $(date -u -d @$total_duration +'%H:%M:%S')"

Once you’re all set, let’s make the script executable (chmod +x sync.sh), and open a terminal screen (screen -S sync). Once you’re ready to go, run the script we just created (./sync.sh). Once it’s started and starts outputting successful progress, you can detach the screen (cmd+a d) and it will carry on running in the background.

As it processes through each archive file, the estimated time remaining will adjust itself - it’s calculating the average time per archive and then applying that to the number of remaining folders it needs to handle.

Whenever you want to check in, you can tail the log file (tail -f /tmp/sync_log.txt) or you can rejoin the screen with screen -x sync to see the estimated remaining time. Don’t forget that if you rejoin the screen, you need to detach it with cmd+a d or it’ll close and you’ll have to restart!

Syncing multiple GBs from EFS to a remote S3 account

How do you find the good stuff buried in the noise?

I read a lot so you don't have to