8 Comments

Summary:

Storing data in a server is a tried and tested process. In the cloud, however, storage is still a work in progress. And the cloud model puts increased pressure on networking and server equipment — and on vendors to make their components reliable.

Load balancers are a cornerstone of any big computing application. By spraying traffic across lots of servers, they let companies turn many unreliable machines into one reliable service. But that service has a lot of moving parts, and sometimes they break. If it keeps happening, it may signal that a new class of networking device is needed for the demands of cloud computing.

According to Amazon’s Web Services Developer Connection, a load balancer was deployed in its S3 storage service on June 20, and removed two days later. During that time, it was corrupting bytes of data sent to the S3 storage service when under load.

This isn’t the first time load balancers have been implicated in an outage at Amazon. At O’Reilly’s Velocity conference, conference co-chair Jesse Robbins talked about a “redundant array of inexpensive data centers” as the basis for tomorrow’s computing platforms. Load balancing is what makes this possible.

Storing data in a server is a tried and tested process. We’ve had decades to optimize the way we store and retrieve data, with standards like iSCSI and IDE proven out worldwide. And RAID is pretty reliable these days.

But the cloud’s still figuring out storage. There are competing models: S3, SimpleDB, BigTable and so on. And with more and more applications relying on S3 for their data, outages are bound to be visible and public.

The cloud model puts increased pressure on networking and server equipment, and on vendors to make their components reliable. Load balancers built for enterprise data centers may not be suited for the cloud, just as domestic power generators wouldn’t work for utility companies. This may be one reason Google is reputed to be building its own switches.

Expect clouds to require significantly different kinds of networking equipment. It’s either an opportunity for networking vendors like Cisco and Juniper, or a huge threat to them.

You’re subscribed! If you like, you can update your settings

  1. A big part of the problem is a lack of experienced individuals when it comes to operating large-scale clusters. To use the Amazon S3 example above, best practices for extremely large hardware clusters is to use end-to-end software checksums to catch the occasional hardware failure that will allow corruption to get past hardware CRC and checksum systems. It looks like they learned that lesson the hard way, but they should not feel too bad because Google did too. If you put enough silicon in a room, you can no longer count on its internal error correction mechanism and software checks need to be instituted, something originally discovered by the supercomputing community that still has not penetrated the broader developer space.

    I will generally agree, though, that a lot of network gear is poorly designed for large-scale distributed systems, either having inexpensive silicon that is too under-powered architecturally for clusters and max-load usage (for the cost conscious markets) or being “carrier class” networking gear with performant silicon but also a ton of other features glued on that are useless for distributed cluster applications and which drive the price way up. It is just a matter of time before one of the networking gear companies starts producing switch engines specifically designed for large-scale cluster applications if they haven’t already.

  2. Fazal Majid Friday, June 27, 2008

    I think Amazon handled this one quite well, in fact.

    The error was introduced in the load balancer in the part where it copies data from one incoming SSL connection to an outgoing (presumably non-SSL) connection. This could be due to any number of reasons, including defective memory or firmware code bugs. Since the corruption was introduced between connections, the TCP checksum mechanisms on either connection wouldn’t have caught it, which is why there is still a need for end-to-end checksums. Amazon’s API does actually have such a mechanism, it was just not required or enforced inconsistently. I am sure they will fix that in the next release.

    The conclusion I draw from this incident is the opposite of yours – hardware is much more reliable than software, but you can’t trust it entirely either. Simply replacing the load-balancer with another from a different brand will not eliminate the vulnerability, and thus it is not a solution. Cost or manageability, or defect rates would be good reasons for Amazon to switch load-balancer suppliers.

    HP ProCurve has an interesting pitch – all their Ethernet switches now have a programmable CPU core per port, and they offer SDKs to partners to implement custom advanced functionality in those. A ProCurve switch costs an order of magnitude less than a F5 or Netscaler, and the per port ASICs should be perfectly up to the task of load-balancing and simple firewalling (if not SSL acceleration, which is much more computationally intensive). A company like Amazon or Google could save a lot of money by entering the ProCurve partner program and writing their own custom cloud-oriented logic that does exactly what they need and not one bit more, to reduce costs, complexity and the likelihood of bugs as the previous poster noted.

  3. David Ulevitch Friday, June 27, 2008

    1) The network is designed to scale. A broken loadbalancer doesn’t indicate anything.
    2) The customers in the clouds are buying Arastra switches, the exact same hardware base Google is using (Google just using their own software). All these chips are made by Fulcrum Microsystems. They are doing *very* well.
    3) This is nothing new. It is not a threat to Cisco or Juniper. Cisco will eventually just buy Arastra. This happens every couple years.

  4. Craig Balding Thursday, July 3, 2008

    Alistair

    My take on this is similar to that of Fazal. Amazon support API level integrity checks in the form of MD5 to detect potential corruption at time of transfer. It was the people using MD5 that picked up this issue – for others their data was getting silently corrupted (!).

    As S3 is primarily targeted at developers, this incident demonstrates the need for greater awareness around Cloud Storage integrity API options and limitations.

    I’ve posted on the issue here:
    http://cloudsecurity.org/2008/06/25/a-question-of-integrity-to-md5-or-not-to-md5/

    Thanks,
    Craig

  5. S3 Outage Highlights Fragility of Web Services – GigaOM Sunday, July 20, 2008

    [...] shifting online, web services are still fragile, in part because we are still using technologies built for a much less strenuous [...]

  6. Scale Fail : Beyond Search Monday, July 21, 2008

    [...] shifting online, web services are still fragile, in part because we are still using technologies built for a much less strenuous [...]

  7. Will Microsoft Tempt Enterprises Up To the Cloud? – GigaOM Tuesday, October 28, 2008

    [...] very sensitive data, and all of them said their clients would balk at cloud storage until they get a closer look at the security and reliability of the architecture. In the U.S. there are, at the very least, regulatory hurdles around storing sensitive data [...]

  8. SpringSource Buys Startup to Scale Messaging in the Cloud Tuesday, August 10, 2010

    [...] backed by major banks, Cisco and a handful of smaller companies. As hardware is virtualized, translating some of the network equipment like load balancers into software allow services running on the virtualized hardware to scale better. Hopefully we’ll learn [...]

Comments have been disabled for this post