The situation

OK, they say knowledge is power. I never realized how true this phrase could be until last week!

One of my web applications deals with a huge number of documents, all stored on Amazon S3. We're talking about millions of files that it could take - literally - several hours just to list them.

Needless to say that this huge number of objects consumes an equally huge space. Yes, you thought right, we use terabytes to measure it. So, one day I thought: “Why don't we store those files that are very old on some less expensive storage tier?". I did a quick search on Amazon S3 products when I noticed Glacier. Glacier promises to keep your files at a much lower price tag than the standard S3. I thought that this is just a slower storage type. Your files are as available as they are, it'll just take a couple of seconds more to fetch them (so little did I know).

When hell broke loose

The procedure is much easy. I am mentioning it here for your reference. But, please DO NOT attempt to follow it unless you read this.

  1. Go to your S3 dashbord. Just navigate to aws.amazon.com, choose S3 from the list of services and click on the link that appears in the dropdown list.
  2. Choose the bucket that you want to work with or enter its name to filter it out.
  3. Click on the “Management” tab.
  4. Make sure that “Lifecycle” is highlighted and click on “Add lifecycle rule”
  5. When the wizard launches, choose a rule name and - optionally - a prefix to match only objects (aka files) that start with a specific string.
  6. In the transition, you can choose the current version only or also previous versions. Then, click on the small link titled “+ Add transition”. Click next.
  7. In Object creation dropdown list, choose “Transition to Amazon Glacier after” and select the number of days post creation date. For example, if you wanted to arhcive all objects older than one year to Glacier, type 365 in this box. Click next.
  8. In the Expiration tab, you can also configure the rule to delete files that are even older (like ten years) automatically.
  9. Click next to revew and apply the settings.

There are also several ways to do this on the command line with Amazon CLI or Python boto, but this is not in scope.

Once you apply the rule, it'd only take a few seconds and all the objects that match the lifecycle rule will be moved to Glacier. Nice isn't it? Not in my case.

The next morning, I woke up with an overwhelming list of complaints. They came in all flavors, and through all means of communication that could possibly exist in the twenty-first century: e-mails, texts, messages, phone calls, you name it. Making a quick look at the log files before heading to work to handle the situation, they looked no less hostile.

Yes, as you might have thought, thousands of error messages, log lines, warnings…etc. all revolving around one fact: file(s) not found. It was like when hell broke loose.

Initial steps to handle the situation

In a matter of minutes, I realized the criticality of the situation I put myself in. The files are not available to the web application, they can be availed on demand, and it costs money to do this depending on how soon they will get restored.

Amazon has three methods of restoring files from Glacier: expedited, standard, and bulk. You can read more about the restore procedure here: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/restore-archived-objects.html and the pricing model here: https://aws.amazon.com/glacier/faqs/. However, it doesn't need a data scientist to realize that the slowest (yet cheapest) option is bulk.

So, now that we agreed on the best option to restore our glaciered files, what about the quantity? Obviously, this requires a well-written shell script, that's when my second problem rised.

Listing insanely large number of objects is, well, insane!

First thing I needed was to list the objects that got affected so that I can anticipate the time and cost of retreiving them. I issued the following command:

aws s3 ls "s3://mybucket"

It took very long that I had to open another terminal session to montor the machine's performance. I noticed that the amount of used RAM is climbing up high. Eventually, it reached the maximum. That was when the command returned back with a nasty error message. I don't blame it. Listing millions of objects all at once may not be the best approach.

Don't get me wrong, the above command may work on a fewer number of objects and/or a more powerful machine. I am just stating what happened.

The solution

I did much searching, trial and error till I came up with the following steps that finally worked.

Using pagination to make object-listing easier

Not only will it make it easier, but the following script will also give you a live view of the objects as they get retreived so you can filter them in some sort of realtime:

#!/usr/bin/python
import boto3

def iterate_bucket_items(bucket):
    client = boto3.client('s3')
    paginator = client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(Bucket=bucket)

    for page in page_iterator:
        for item in page['Contents']:
            yield item


for i in iterate_bucket_items(bucket='orderdatabucket'):
    print i

I ran this script as follows:

./list.py > objects.txt

A quick explanation

  • The script uses Python boto3 library. You can install it using pip by running pip install boto boto3
  • It uses an AWS CLI client called list_objects_v2, which would list objects in a specific bucket. However, and as mentioned here, it is limited to the first 1000 records.
  • So, the script makes use of the paginator class, which is able to deliver objects in pages of 1000 itens each.
  • The rest is just a forloop that displays the results to the standard output, where you can redirect them to a text file.

Filtering out GLACIER objects

When the script starts running, and if you run tail -f against the output file, the objects will look something like this:

{u'LastModified': datetime.datetime(2013, 9, 29, 16, 52, 39, tzinfo=tzlocal()), u'ETag': '"xxxxxxxxxxxxxxxx"', u'StorageClass': 'GLACIER', u'Key': 'xxxxxxxxxxx', u'Size': 28015}

This is just one line of the output. In my case, I had millions of it. But the file lists all the objects in the bucket: those in GLACIER and those that are still in STANDARD storage class, so I needed to filter them out. A simple grep command did the job:

grep Glacier objects.txt > glacier_objects.txt

Extracting the file names (object keys)

Amazon S3 uses a bit different terminology when working with your data. So, a file becomes an object and a filename becomes the object key. So, I needed a list containing only the object key names, each one a separate line. The following Perl script was what came to my mind:

use strict;
use warnings;
use 5.012;

my $file = 'glacier_objects.txt';
open my $fh, '<', $file or die "Could not open '$file' $!\n";

while (my $line = <$fh>) {
   chomp $line;
   my @strings = $line =~ /u'Key': '(.*?)'/;
   foreach my $s (@strings) {
     say "'$1'";
   }
}

You may want to slightly change the regular expression on line 10 depending on whether your keys contains special characters or other requiremerns.

So, I ran the above script as follows:

perl clean.pl > just_objects.txt

Reducing the restore time from 92+ hours to under 24

The file produced by the above Perl script was several million lines long. I needed to run the restore command (s3cmd restore --restore-days=30 --restore-priority=bulk "s3://mybucket/filename") on each key. This means traversing the file sequentially.

I made a small test on a number of keys in the file to assess the speed at which things would go. It was able to restore 3 objects/second. Using second-grade math, I realized that it'd cost me about 92 hours (about 4 days) for restorig just 1 million files.

The solution I came up with was to launch a relatively powerful AWS EC2 instance (I used one with 8-cores and 16 GB memory), split the large file into several smaller files, each containing about 100000 records, and launch the operation simultaneously. Now, that was a really long sentence!

Splitting the files was pretty easy; as the split command is available in almost all Linux distributions:

split -l 100000 just_objects.txt

It will produce a number of files named sequentially like xaa xab xac and so on.

Now, and using an extremely useful tool called byobu (http://byobu.co/), I was able to launch a separate terminal* for each file and issue the following command:

while read line; do
    s3cmd restore --restore-days=30 --restore-priority=bulk "s3://mybucket/${line}"
done < xaa

The script is pretty easy, it's just a while loop that reads the lines from the file line by line. Each line contatins just the object key, so the restore command is issued against it. Of course, the restore priority was bulk Notice that I made the validity period of the restored file 30 days so that I have enough time to completely move them out of Glacier back to the standard storage class. More on that later.

And for about 10 hours, the eight cores of the instance where I ran the scripts hardly ever left the 100% mark! But at the end, all the objects were restored successfully.

The final step

As mentioned before, restoring the objects will only make them available for the specified period of time (in my case, I set it for 30 days). But, to totally get your objects back to the standard S3 storage class, an extra step had to be performed.

You need to copy each object, setting the source and destination paths to be the same and only chaning the storage class in the process. The following script does exactly this:

while read line; do
    aws s3 cp "s3://mybucket/${line}" "s3://mybucket/${line}"  --storage-class=STANDARD --force-glacier-transfer
done < xaa

I had to run this script on all the byobu terminals that hosted the restore script. The cp command may - naturally - take longer to process. However, thanks to splitting the files and launching the scripts in parallel, the process took about 17 hours.

I had to list the objects one last time using the boto3 script mentioned above to ensure that all the files have been restored and are currently in the standard storage class.

This might be the best way, but it worked!

This article listed the steps I followed to restore millions of objects that got accidentally buried in Amazon Glacier. There's always more than one way to get things done, and this is not exception. I was in a hurry and eager to get everything back as soon as possible so I followed what appeared reasonable to me. So, this might not be the fastest/most-efficient/lowest-cost solution to similar problems, but it worked for me and i'm glad it did.

  • The reason I used byobu is that running the tool with nohup never worked. I never tried to investigate the reason becuase I was eager to get things done in the shortest possible time.