Advertisements

Restoring and downloading S3 Glacier objects using s3cmd

I currently have a portion of my backups on S3, with a life-cycle policy that includes moving the objects to Glacier after a period of time. This makes the storage much cheaper ($0.01/GB/Mo from $0.03/GB/Mo – Source), but has the downside that objects require a 4-hour restore period  before they can become available for download. I have had need for some objects quickly, and so the 4-hour restore time isn’t worth the savings. Unfortunately, once an object has had this life-cycle applied to it, it can only be temporarily restored. In order to make it a standard object again, you have to download it, delete the Glacier object, and then re-upload it. Unfortunately, doing it all wasn’t quite as straightforward as I thought it might be. But, (I think) I figured out a way to get it done rather painlessly.

I’m going to be using s3cmd and a few cron jobs to automate this.

First, get s3cmd version 1.5. This version supports initiating restores on the Glacier objects. You can recursively initiate a restore on every object in the bucket, but when it hits a non-Glacier object it will stop. You can also use s3cmd to initiate a download of all the objects in the bucket, but when it hits a Glacier object, the download will stop. And you will end up with a zero-byte file. (Hey s3cmd developers, would you mind fixing this behavior, or at least writing in something to force progression on a failure, so we can walk through the entire bucket in one go?)

The solution had to involve initiating restores, waiting at least 4 hours for the restore, then going back for the restored data and deleting it from the buckets, then deleting any zero-byte files, and then doing it all over again later.

Ain’t nobody got time for that. Except cron. Cron has plenty of time for that.

First of all, make sure you have s3cmd installed and configured (with s3cmd --configure). Then you can configure the following script to run every 4 hours. I’m not going to go into much detail on this. If you’re familiar with s3cmd and Amazon S3/Glacier, you can probably figure out how it works. I wrote it as a short-term fix, but it’s worth sharing.

#!/bin/bash

# This script should be fired every 4 hours from a cron job until all
# data from the desired bucket is restored.
# Requires s3cmd 1.5 or newer

# Temp file
TEMPFILE=~/.s3cmd.restore.tmp

# Bucket to restore data from. Use trailing slash.
BUCKET="s3://bucketname/"

# Folder to restore data to. Use trailing slash.
FOLDER="/destination_folder/"

# Because of the way s3cmd handles errors, we have to run in a certain method
# 1: download/delete files from bucket,
# 2: run restore on the remaining objects
# 3: Do housekeeping on the downloaded data

if [ ! -f $TEMPFILE ]
then
touch $TEMPFILE
echo === Starting Download Phase
s3cmd -r --delete-after-fetch --rexclude "/$" sync $BUCKET $FOLDER
echo === Starting Restore Phase
s3cmd -r -D 30 restore $BUCKET
echo === Starting cleanup
# s3cmd doesn't delete empty folders, and can create empty files. Clean this up.
find $FOLDER -empty -delete
# but it might accidentially delete the target directory if the download didn't
# happen, so we have to fix that now
mkdir $FOLDER
rm $TEMPFILE
fi

Note that restore, download, and delete operations can incur extra costs. Be aware of that before proceeding.

So that’s it. I *should* have my entire S3 bucket downloaded completely within the next few days, and then I can migrate to what I hope is a more simplified archiving plan.

Advertisements

, , ,