An explanation of Amazon’s recent S3 outage
Amazon Web Services recently experienced a three-hour long outage within its simple storage service (S3) offering. The outage only lasted a few hours, but had massive ramifications for dozens of sites hosted in the US-EAST-1 (Virginia) region.
The company has explained that they were attempting to debug an issue causing S3’s billing system to run slowly. At 9:37 AM PST (4:30 AM Wednesday AEST), one of the technicians ran a command intended to remove some servers geared to supporting the S3 billing process. However, the technician entered an incorrect input for the command which resulted in a larger number of servers being removed — including servers required for indexing, metadata and location information of S3 objects in the US-EAST-1 (N. Virginia) region. As a result, AWS S3 couldn’t process any HTTP requests in the region.
A placement subsystem is used during PUT requests to allocate storage for new objects added to S3. The removal of a significant portion of the capacity caused each of these subsystems to require a full restart. During the restart process, S3 was completely unable to service requests.
Other systems in the region that relied on S3, including the console, service health dashboard, Elastic Compute Cloud (EC2), Elastic Block Store (EBS) and Lambda, were impacted while the S3 API was unavailable.
As a result of the cascade failure, Amazon could only report the outage via Twitter and a website banner.
This outage is a reminder that the cloud is vulnerable to these sorts of events and human error. It’s best to ensure that you have a recovery option or fail-safe for your third-party services.
There are many options to guard your product against an outage of this nature. You can make use of multi-region high availability and replication of your file storage service—this can be done within S3. Alternatively, you can make use of high availability across carriers, by replicating your S3 buckets to a competitor service like Google Cloud Platform or Microsoft Azure. By doing so, you can easily switch traffic as necessary to avoid outages.
This is why it’s important to build your applications following the 12-factor development methodologies. A simple swap of environment variables, or an automatic failover, could point upload and download requests to an alternate endpoint.
Email [email protected] if you have any questions.