7 Comments

Summary:

After years of hype, the IT industry finally had a rude awakening this spring that reminded us that cloud computing infrastructures are vulnerable to the same genetic IT flaw that plagues traditional data center operations: Everything fails sooner or later. Here’s how to build around that.

Test often to make sure your app isn't failing users.

After years of hype, the IT industry finally had a rude awakening this spring, reminding us that cloud computing infrastructures are vulnerable to the same genetic IT flaw that plagues traditional data center operations: Everything fails sooner or later.

In March, an 8.9 earthquake and subsequent tsunami caused widespread disruptions to power supplies and network connectivity to data centers across Japan, causing Japanese companies to rethink their traditional disaster recovery strategies. Several weeks later, the EBS system in one of Amazon’s EC2 data centers in the Eastern U.S. failed due to a faulty router upgrade and a cascade of resulting events, sent hundreds of customers—including many Web 2.0 companies such as Foursquare and Reddit—scrambling in an effort to resume services.

Ironically, these events also highlight how cloud infrastructures, when managed correctly, actually provide unprecedented capabilities to deliver high availability, resiliency and business continuity in IT operations.

Planning for Failure in the Cloud

Protecting your organization from unplanned downtime is widely dependent on building redundancy and diversity directly into your disaster recovery and business continuity systems. Business systems need to be able to run on a number of different infrastructures — whether they be public clouds such as Amazon or Rackspace, or private clouds using traditional on-premise hardware — and be able to fail over between them quickly and efficiently as necessary.

Despite the Amazon outage, public clouds now provide organizations with an impressively wide array of options to implement business continuity at a level of affordability that simply did not exist a few years ago. Consider this: Right now, from my laptop, I can launch servers in a dozen disparate locations worldwide – including the U.S., Europe, and Asia – for pennies per hour. As a result, I can design a system for my business that can reasonably withstand localized outages at a lower cost than previously possible.

The key is to design your infrastructures for the possibility of failure. Amazon’s CTO Werner Vogels has been preaching this religion for many years, suggesting the only way to test the true robustness of a system is to ‘pull the plug.’ Netflix — itself a major cloud infrastructure user — has created a process it calls “the Chaos Monkey” that randomly kills running server instances and services just to make sure the overall system continues to operate well without them. Not surprisingly, Netflix’s overall operation saw little impact from the AWS U.S. East outage when it occurred.

Implementing failure-resilient systems isn’t easy. How can you quickly move your operations from one infrastructure to the next when the pressure is on and the alarm bells are ringing? How do you design a system that not only allows new compute resources to begin to operate as part of your service, but also folds in an up-to-date copy of the data your users and customers depend on?

Redundancy and Automation in the Cloud

There is, of course, no magic bullet. But there is a general approach that does work: combining redundancy in design with automation in the cloud management layer. The first step requires architecting a solution that uses components that can withstand failures of individual nodes, whether those are servers, storage volumes, or entire data centers. Each component (e.g. at the web layer, application layer, data layer) needs to be considered independently, and designed with the realities of data center infrastructure and Internet bandwidth, cost and performance in mind. Solutions for resilient design are almost as many and varied as are the software components they utilize. For example, databases alone comprise a wide range of approaches and resiliency characteristics, including SQL, NoSQL, replication, caching technologies, etc.

But the secret sauce really comes in how your architecture is operated. What parts of the system can respond automatically to failure, what parts can respond nearly automatically, and which not at all? To be more specific, if a given cloud resource goes down — be it a disk drive, a server, a network switch, a SAN, or an entire geographical region — how seamlessly can you launch or fail over to another and keep operations running? Ideally, of course, the more that;s automated (or nearly so), the better your operational excellence.

Achieving that level of automation requires your system design and configuration be easily replicable. Servers, for example, need to be quickly re-deployable in a predictable fashion across different cloud infrastructures. It’s this automation that gives organizations the life-saving flexibility they need when crisis strikes. Our own RightScale ServerTemplate methodology, as an example, provides this re-deployment capability that allows a server, if brought down from an outage, to be launched in another cloud in a matter of minutes.

Customizable Best Practices in the Cloud

The right cloud management solution should simplify the process of launching entire deployments through customizable best practices. It should also provide complete visibility into all infrastructures through a central management dashboard – a ‘single pane of glass’ – through which administrators can monitor performance and make capacity changes based on real-time needs. The same automation and control that gives organizations the ability to scale up or down using multiple servers when demand increases also allows them to migrate entire server deployments to a new infrastructure when disaster strikes.

The fallout from the Japanese earthquake and Amazon outage is being felt throughout the business community and is causing organizations to rethink how they ensure business continuity. Cloud architecture provides the distributed structures necessary to counteract regional disasters, but companies also need the cloud management capabilities necessary to fail over their operations to multiple infrastructures in a way that keeps things up and running.

Some may have thought the cloud was a magic bullet. It’s not, and that’s actually good news. By recognizing one of the original founding principles of cloud architectures — that everything fails at some point — businesses are now in a position to design and build services that are more resilient than in the past, at a fraction of the cost. With the right architecture and management layer, cloud-based services can provide unparalleled disaster protection and business continuity.

Michael Crandell is CEO of RightScale, the leader in cloud computing management.

  1. Poul Erik Holm Monday, May 30, 2011

    Absolutely right Michael, the problem with cloud and risilience is more the application layer that the infrastructure. not that many apllications are suitable for Cloud. but it seems that many companies think that there normal business application can be put into the cloud and then its risilient and cheap, but “business application” are rarely suitable for moving around in a cloud environment.

    Share
  2. To me cloud environment is not problematic at all…The reason is multifold – 1) If you look at IT companies in Bangalore alone they are in millions and the amount of travel to the company is more and more which causes environment hazards!!! 2) Since we all IT people came from programming background i feel if “some cloud is down” else move to next cloud etc., Something like clustering but no company is offering multicloud where the person buys one cloud then another cloud is free for hosting contents 3) In many ways cloud infrastructure helps, i think of factory outside the cities in olden days will cause environment but not like us the one today…I feel if we think that way we can make datacenters like a factory in the olden days!!!

    Think Good Talk Good Be Good

    Share
  3. Very intresting ands informative thank you

    Ps what do you think of one of our new solutions we have in our oryfolio ms vm are them selves uses on their own DC! everything does fail eventually I agree but I’d you cant see it you cant fix it. Downtime being a constant threat, finding/fixing those
    ‘issues’ Is of course key across any network be it virtual or physical…see Xangati in our portfolio, obviously you’d want to see it running in a working envtionment before giving a true review but I’d respect your opinion.

    Best regards

    Matt

    Share
  4. Rick Parker Monday, May 30, 2011

    The flaws were due to poor design or process and dont have anything to do with a cloud infrastructure. A good cloud design will full redundancy will always be more reliable then traditional server room IT if nothing else due to backup power generators and redundant power circuits

    Share
  5. Mans bhuller Monday, May 30, 2011

    Hear hear!
    Whether a service is cloud based or physically at your fingertips, designing redundancy across all layers requires both infrastructure and application knowledge. Cloud can not fix bad application availability design (e.g no transaction rollback during failure) and vs/vs. Let’s wait to see if cloud service providers do a better job with their new services ontop of their own cloud infrastructure.

    Share
  6. Although cloud computing has been with us for years now, its full capabilities are still considered new. Because of this, some of its features should still be tested and observed. Companies, especially those that belong to the small and medium size enterprise group, should NOT FULLY rely on cloud computing to run their businesses because it is still on its infancy. You can’t give your full trust to a child right?

    Let’s wait for cloud computing to fully mature. Who knows, five to 10 years from now, it will be the biggest IT innovation since the Internet.

    Share
  7. I need to understand more on cloud computing in the area of infrastructure and security. How effective it is in terms of cost manageability

    Share

Comments have been disabled for this post