How difficult is it to run a public cloud service?
As all of us know, Amazon Web Services (AWS) experienced an outage on 21-Apr-2011 and that lasted for almost 4 days. Quite a lot of companies were affected and you can find the list here. The Internet was flooded with articles speculating what went wrong, whether cloud computing is viable in the long run, how Amazon services did not function as advertised, how the applications should be built, etc. While most offered their opinion in broad strokes such as “use multiple regions/clouds”, “use built-in redundancy”, “don’t use public clouds”, there were two that stood out in offering very valuable piece of advice in my opinion, and are based on first hand experience: George Reese’s and RightScale’s. Amazon eventually published a detailed post mortem of the incident on 29-Apr-2011. It is very long and Amazon was very forthcoming in providing the details. To really understand the problem, one needs to understand how the affected service, EBS, is implemented. Netflix’s Adrian Cockroft wrote a wonderful article but here is my take.
Amazon Web Services (Cloud) consists of multiple regions. Each region consists of multiple availability zones (AZ). Amazon says that is each AZ zone is completely independent. Among others, EC2 and EBS have affinity to AZ. EBS offers block-level storage to the cloud virtual machines. Each unit, called a data volume, has a fully-redundant replica and these two are frequently sync’ed to account for failures – if at any time, a data volume detects that its replica’s health is bad, it creates a new one and syncs up all the data immediately; make a note of this point! While the data volumes themselves are fully isolated in an availability zone, there is an entity called EBS control plane that is common for all the AZs in a region.
Now to the problem: As part of scaling their services, AWS re-directed the primary EBS traffic to a redundant router; that’s what triggered the avalanche. Rather than re-directing to a full-capacity primary backup, they accidentally did it to a lower-capacity secondary backup network. As a result, suddenly both the primary and secondary networks were affected — the former flooded the latter which didn’t have the resources to handle the load of both networks. That was the first problem. When Amazon fixed it, the re-mirroring storm happened. Remember how each EBS volume has a replica and constantly monitors its health? When the network disruption happened, the EBS volumes could not contact their replicas and assumed that they were dead. They aggressively took steps to create new replicas and this overloaded the EBS control plane. Remember again that the control plane is shared across AZs in a region. Since it was overwhelmed with the requests of re-mirroring, it could not “service the normal requests” of unaffected AZs. So, not sure if it is entirely appropriate to say this but Amazon’s “aggressive” algorithms to prevent customer data loss worked against them in this case. One of the issues that we can quickly see here that each EBS volume does not have global visibility! Had it known that other volumes are affected too, it could be made to backoff when many are trying to do the same. It appears that because they were acting as silos they acted in the earnest and essentially caused an internal Denial of Service. For an exact blow-by-blow account, please read Amazon’s post-mortem article.
Amazon said that it is treating this as an opportunity to improve their infrastructure, the details of which can be found in their article. In addition, Amazon has created an architecture page for all AWS developers. It has a lot of resources for developers to design and plan their cloud-scale applications. They are also holding a bunch of webinars to help customers understand the fault-tolerant design aspects better. More details at http://aws.amazon.com/architecture/.
Over the weekend I attended the “cloud meetup” organized by the Silicon Valley Cloud Computing Group held at the Microsoft campus in Mountain View. One of the sessions I attended was a panel discussion, “Surviving Public Clouds Outage,” by Netflix’s Adrian Cockroft and Nimbula’s Chris Pinkham. They essentially reiterated a few things (not verbatim):
- Amazon’s system is one of the most well-designed and architected. Problems like these happen and it is important to see how Amazon is handling it and making it better for its customers.
- Application developers should really understand what services suit best for their needs.
- Multi-region redundancy should also be considered, not just multi-AZ redundancy
- Hot-standby (full redundancy) comes with a cost. Need to understand how and when various instance types of Amazon should be used (Normal, Reserved, and Spot)
- Amazon’s EBS system currently has 500 million volumes. Testing for a whole AZ failure or a while region failure is never realistic. The better we understand it, the better we can design.
Netflix also published a blog on the lessons learned from AWS outage. They essentially use EBS very scarcely, had a multi-region dependency, and use reserved instances to optimize the cost. Smugmug, the photo-sharing site that also survived this Amazon outage, published a blog. They both are excellent and like someone tweeted, it is always good to learn from the guys who have tried, failed and learned rather than from an armchair quarterback!
I want to conclude by posting this tweet from George Reese. “Reality is: AWS outage = plane crash; IT outage = car crash. And car accidents are a much bigger problem.” After all, flying is still the safest form of travel! Amazon has shown that it is doing its best to make us feel comfortable, particularly after this incident. This should only increase our confidence in the cloud but also make us wary of the issues involved and how to handle them better.
[Ed. note: Trend Micro would like to know what you think about this. We enthusiastically invite your comments and we will read every one of them. For very detailed information: