AWS S3 Glacier explained

In some cases files need to be stored and kept for years for audit or compliance purposes, and they are rarely accessed. S3 Glacier is a cheap and flexible solution for archiving data. What is exactly S3 Glacier and how does it work?

Last week I wrote about S3, which is AWS’s object storage management system. Large amount of data can be easily uploaded to S3 and accessed from it in a matter of milliseconds.

But in some cases, data don’t have to be frequently accessed at all. For example, when a company need to keep documents for audit and compliance purposes, they only need to store those files somewhere for the required amount of time, like 5 years. These data might not be accessed every day, if ever at all.

AWS has a solution for this type of storage which is called S3 Glacier.

1. What is S3 Glacier?

In the post on S3 referred above, I mentioned that S3 has multiple storage classes.

S3 Glacier is a special storage class of S3, and it’s suitable for secure, durable and long-term data storage for infrequent access at a low cost.

S3 Glacier is secure because data are encrypted on server side by default, and the transfer to S3 Glacier supports encryption, too. By default, only the person who created the vault (the storage container for data archives) can access data. If necessary, other users in the account or other accounts can be granted access to the stored files through user- and resource based policies.

S3 Glacier is durable, because data are stored across multiple (at least three) locations called Availability Zones (AZs). For more information on AZs, see this earlier post and the AWS documentation.

S3 Glacier is for long-term data storage. A minimum of 90 days of storage policy applies to archives, and if the stored file is deleted before 90 days, a deletion fee will apply.

S3 Glacier is a low cost object storage class for archiving large amount of data. The storage cost is usually less than 1 cent per gigabyte per month, depending on the region.

2. Main features

As mentioned above, S3 Glacier is a highly available and durable archive storage service.

2.1. Storage

Data are stored as archives in S3 Glacier. The most typical and often used archive formats are TAR and ZIP. It’s possible to archive only one file, but it’s often more cost effective if multiple files are bundled together in one archive.

There’s no upper limit to the number of archives that can be stored on S3 Glacier. The size of one archive can range from 1 byte to 40 terabytes, and maximum of 4 gigabytes can be uploaded in a single request.

If the size of the archive is greater than 4 gigabytes, the multipart upload feature needs to be applied. The multipart upload facility uploads the files in several parts in parallel, making the upload faster. Once the upload finished, the archive will be constructed again from the parts.

If an archive is uploaded, it cannot be changed anymore. If, for any reason, a change in the saved archive is necessary, the file needs to be deleted first, and then the modified archive needs to be uploaded.

Archives are stored in containers called vaults, which is a concept similar to S3 buckets. Vaults can easily be created with a few button clicks in the AWS console.

2.2. Durability and availability

S3 Glacier, similarly to most S3 storage classes, offers high durability of eleven 9’s (99.999999999%) per year. This means that the probability of losing stored data is extremely low, it’s practically zero. S3 Glacier will only return a SUCCESS response after upload if the data has been saved to all locations.

The availability of the S3 Glacier service is designed for 99.99% per year after data are restored (retrieved).

2.3. Security

Data stored in S3 Glacier are encrypted by default, and this is called encryption at rest. S3 Glacier manages the encryption keys and the protection of keys using AES-256 ciphers. If the data that is uploaded to S3 Glacier are already encrypted, they will be re-encrypted by S3 Glacier.

It’s also possible to encrypt data in-transit, i.e. while the data are uploaded to S3 Glacier.

Access to vaults and archives are controlled through policies, which can be user-based and resource-based.

User-based policies are based on the Identity and Access Management (IAM) service, and control which users or user groups can have permissions (read, write, delete) on the vaults.

Resource-based policies are attached to the resource (in this case, the resource is the vault itself), and they control access to all users. For example, a vault policy can allow read or ban delete operation on a vault to all users. This is much more convenient than going to the IAM policy of each user, and setting up the permissions one-by-one.

3. Retrieval

In some cases, even archived data have to be accessed, and to do this, a retrieval request needs to be initiated. Once the retrieval is complete, data can be downloaded from S3.

S3 Glacier offers three retrieval options based how soon the data are available after the request.

The Expedited retrieval is the fastest and the most expensive. With this option, data will be available for download within minutes. This option is only recommended when small number of archives is urgently needed, and it should only be used occasionally. This option depends on capacity from AWS, which can be guaranteed by buying provisioned capacity, which costs a lot of money compared to the cost of S3 Glacier.

The recommended retrieval method is the Standard, and archives will be available within 3-5 hours. This retrieval option can be used when data backup is needed or the data will be used in a planned event.

The third option is the Bulk retrieval, and this option is the cheapest of all. As its name suggests, the Bulk retrieval allows the user to retrieve large amount of data within 5-12 hours. This option is definitely not recommended for urgent cases.

4. Deep Archive

AWS will soon offer a new S3 storage class called Deep Archive, which will be the cheapest archiving solution.

In many cases, data don’t have to be accessed at all, and S3 Glacier Deep Archive will offer a cheap option to store these data in a 21st century way.

All data stored in S3 Glacier Deep Archive will be available for download within 12 hours.

5. How does pricing work?

Pricing for S3 Glacier is fairly complex, because the final amount consists of multiple parts.

It’s no surprise and it has already been mentioned that we have to pay for storage. Storage can cost less than 1 cent a month per gigabyte, and because S3 Glacier is designed for infrequent data access, this part will usually have the largest share in the monthly fee.

As discussed above, data can be retrieved from S3 Glacier, and it also costs money. AWS charges for both the retrieval and the retrieval request. Retrievals are billed per gigabyte and the request price is based on per 1000 requests, which is pro-rated. The good news is that AWS offers a free 10 GB retrieval every month.

Upload requests and data transfer are also payable, except if the data transfer between the EC2 instance and S3 Glacier is within the same region.

As it has already been mentioned, if the archive is deleted within 90 days, a deletion fee will apply.

6. Conclusion

S3 Glacier is the cheap archive storage, and it is recommended for storing large amount of data that are accessed infrequently if ever.

It’s a durable and highly available solution, and, similarly to other S3 storage classes, the archives are stored in multiple locations. S3 Glacier is also secure, with data being encrypted on server side by default.

If needed, data can be retrieved, and AWS offers several retrieval options. It’s good to know though that charges apply for both the amount of data retrieved and the retrieval request itself. If S3 Glacier is used as intended, i.e. for just storing archives, the cost will mainly consist of storage.

Thanks for reading and see you next time.