What happens when monstrous storm clouds like those in Northern Virginia last week battle a mammoth cloud computing infrastructure like Amazon Web Services?
The climactic, “climatic” outcome of this heavy weather was a power disruption that swept popular sites like Instagram, Netflix and Pinterest off the Web grid. Heroku, a platform provider to many startups and mobile apps was also brought down by the outage.
Amazon suffered previous down time in the same Northern Virginia facilities in April of 2011.
The untimely collision of external and internal events reignites the discussion of cloud computing vulnerability and configurability, especially the “single host unit-of-failure” scenario.
Had the Websites maintained instances in multiple locations, the trouble may have been averted. The fact is that the very same issues impact other everyday cloud computing services, some as simple as basic email.
The point of cloud computing is to outsource the security of holding your own data in house because you don’t have to worry about failure. But extreme weather, power sags and surges, natural disasters, fire and human-error incidents — all indicate that only multi-hosted or distributed, multi-location services can provide a truly reliable level of mission critical uptime.
The question therefore emerges: Should all Web organizations hosting major e-commerce or other high-performance applications maintain geographic redundancy by default? Like the striped arrays of a RAID hard drive, or the “data diaspora” of a Bit Torrent, shouldn’t Internet hosting services be “arrayed” or Inter-networked by definition?
On April 21, 2011, AWS experienced its first major outage taking a toll on Reddit.com and Netflix among others. The outage was chalked up to a network upgrade that shifted traffic off of one of the redundant routers on the Amazon Elastic Block Store. As it happened, Netflix was designed to be ready for exactly the type of failure that occurred. The Netflix servers did not use EBS as their main data storage and when the traffic was routed off the EBS at Amazon, Netflix was still up.
Seattle native, Aaron Levie, CEO of cloud services company Box, notes that the simplicity of Amazon’s infrastructure-as-a-service model is compelling but also takes a lot of control out of a company’s hands:
“At the end of the day, the cloud’s availability will come down to its physical infrastructure being available — it looks like Amazon’s data center in Virginia experienced a power failure, which knocked out a number of its systems there. For the applications built on top of Amazon, sometimes negative consequences from these events can cascade through your infrastructure (e.g. when one service goes down, it then overloads another service that was otherwise fine), and in other cases some apps just don’t have resilience for these events built into their software,” said Levie.
“AWS doesn’t necessarily promise to handle these situations gracefully for you; because it’s a provider of infrastructure as a service, you get pretty low-level access to the technology (vs. making it super abstracted). That comes with huge benefits, but equally has consequences if the infrastructure disappears. That said, AWS has a pretty great track-record for uptime, but of course given their popularity, when they hit a snag the entire internet notices. At Box, we don’t use AWS for any primary infrastructure, and we run out of a number of our own datacenters to ensure fault tolerance in the event of a physical system experiencing issues, so that helps.”
TechCrunch has reported a link provided by Barry Nolan, the CEO and co-founder of in-app messaging specialist Converser, from the Twilio engineering blog that explains why Twilio, which also runs using AWS, was not affected during a previous outage.
In 2004 an engineer at Amazon presented a paper proposing how the company could make a profit on the infrastructure required to run the Amazon.com store. As a result, Amazon EC2 was built by a team in Cape Town, South Africa. By June of 2007, Amazon could already claim more than 330,000 developers had signed up to use Amazon Web Services. [24×7]