11 Comments

Summary:

My company, SlideShare, has been using cloud computing for almost everything we do. But if comic-books have taught us anything at all, it’s that with great power
comes great responsibility — and we’ve made our share of blunders. Here are a couple of the more notable ones.

Cloud computing is a big deal for startups. The opportunity to essentially have unlimited computing capacity available at the touch of a button opens up amazing new opportunities. The power to launch 1,000 servers at the press of a button (and tear them down just as quickly) is indeed remarkable. But if comic books have taught us anything at all, it’s that with great power comes great responsibility!

My company, SlideShare, has been using cloud computing for almost everything we do, so we’ve made our share of blunders. Below are two of the more notable ones:

How to Lose $5000 Without Even Trying

Several months ago, we became fascinated with Hadoop. We organized a Hadoop Hackday at our office, and very quickly wrote some prototype code for calculating analytics data for SlideShare users.

Hadoop analytics is a perfect task for cloud computing. You need a bunch of
computers, but you only need them once a day to crunch all the data. But as we started testing our prototype code with larger and more realistic data sets, it started taking longer and longer to complete a job.

At that point, I made the call to quadruple the number of machines (from 20 to 75). This decision actually made sense: if it’s going to take 100 computer hours to get a task done, then you might as well have 100 computers work for one hour and get the job done faster.

A few hours after I made that decision, a major site outage occurred that distracted everyone on the engineering team. We worked through the night and the next day, and recovered from the (unrelated) crisis by Friday afternoon. We all took a well-deserved weekend off, and came in Monday morning to discover that the analytics job we’d started before the crisis was still running. Our buggy code was failing in a way we hadn’t anticipated, so throwing hardware at the problem hadn’t helped. Meanwhile, we’d run up a bill of $5000 with Amazon Web Services!

Lesson learned: if you’re going really use the power of cloud computing, you need to constantly monitor spend and make sure that it doesn’t get out of whack and break the budget, especially if you’re going to be scaling up and down dramatically. Unfortunately, Amazon Web Services doesn’t provide any alerting or charting tools that make it easy to keep track of spend; keeping track of spending is a cumbersome process involving downloading csv files, importing them into Excel, and analyzing the data. But it has to be done.

Getting Sloppy With Storage

We recently noticed that our spend on storage (Amazon S3) was increasing dramatically. A few days of investigation revealed that there was a general lack of discipline in how we were using storage. Files that could be deleted were being left in place; files for different purposes were being kept in the same directory; and there were some files that we couldn’t identify where they came from and whether they were still needed or not!

Amazon S3, or any cloud storage for that matter, can be thought of as a giant file system. There’s no over-arching control over what data goes where: It’s up to you to make sure you use the storage in a disciplined way. If only one person is writing the
code, this is easy. However, once you have a team of people writing multiple programs, it’s easier to forget to delete something. You need to make sure you don’t waste storage, and the only way to do that is to be really specific about what data is saved where.
A best practice is to put each type of resource in a separate “bucket” (Amazon’s name for a top-level directory), since that’s really the only way you can get accurate statistics about how much storage is being used for each type.

The Spider-Man Principal

In both cases, we learned we weren’t being disciplined enough to handle the power cloud computing put at our fingertips. If we’d been on leased hardware, we would have hit hardware limits (running out of disk space). It would’ve been inconvenient, but it would have forced us to think about what we were doing, and make a conscious decision to spend more money. It’s great to have the super-power of cloud computing, but you need to be responsible if you want to use it!

Jonathan Boutelle is Co-Founder and CTO of Slideshare, a web site for presentations that relies heavily on cloud computing. Previously, Jonathan was a principal at Uzanto, (a UI consulting firm) and worked as a software engineer at CommerceOne (a B2B enterprise software firm) and Advanced Visual Systems (a 3D graphics startup) You can find his presentations on cloud computing at slideshare.net/jboutelle, and his Twitter is @jboutelle. He also blogs at www.jonathanboutelle.com.

Image courtesy Flickr user adactio under creative commons.

You’re subscribed! If you like, you can update your settings

  1. Pradeep Padala Thursday, August 26, 2010

    Very interesting. Have you built any in-house tools to handle these issues (like checking the storage quota, spent money etc.)? If you have used any open source tools, can you point me to those?

    1. Pradeep,

      We have tools in place to alert us and the customer if too many webbies have been added. We don’t have any open source tools for this at this point, its all part of our webby manager system.

      Out of curiosity, how many instances do you own with Amazon on a permanent basis?

      1. I see, I haven’t used Webby, but seems like a good tool. How do you use it to control your Amazon instances?

  2. Carlos Taborda Thursday, August 26, 2010

    Ouch, thats why implementing a confirmation email once a customer goes over a certain limit is key.

  3. Jonathan Boutelle Thursday, August 26, 2010

    Amazon provides the data in xml form and csv form for download. We’re working on software to process the xml. Unfortunately, a human needs to download it: AWS doesn’t provide any APIs for doing this!

    1. Interesting read. I guess people would easily get carried away with the promise of cheap storage (which is kind of true) and don’t realize it gets added day by day. I would be very interested in knowing what do you mean by “working on software processing the xml”. Are you writing some kind of analytical software which will tell you about your storage usage?

  4. Ouch!

    The approach we took with Blabbelon was to put the hosting account on a credit card with a reduced ($2k/month) limit. That enabled us to go nuts during testing, let the whole thing fail (it self-scales, so the risk is that a bug meant the system could go and buy itself a load of servers we didn’t really need).

    So another useful fail-safe is to ensure that your credit card will start bouncing before the “system” has the opportunity to go completely ballistic….

  5. Jaroslav Gergic Thursday, August 26, 2010

    Could you be more specific about the task you were trying to solve with Hadoop? I mean the analytical use case, data model and data volume. It would be interesting to give it a spin on our platform. The machine numbers you mention in the post are frightening.

  6. Jonathan Boutelle Thursday, August 26, 2010

    Yeah we have a “free-fire” zone for developers … a completely separate account. People can do anything they want, but everything gets wiped out at the end of the workday.

    But for production that approach obviously doesn’t help much…

  7. del.icio.us bookmarks for September 8th through September 20th | Thursday, September 30, 2010

    [...] But if comic books have taught us anything at all, it’s that with great power comes great resp… – %extended% [...]

  8. omis.me » Happy Birthday SlideShare Tuesday, October 5, 2010

    [...] is one of the most underrated and under appreciated Web startups. Great job — Rashmi Sinha and Jon Boutelle, who have become great dosa-muching friends over the years. Congrats and here is looking at many [...]

Comments have been disabled for this post