21 Comments

Summary:

Dealing with the awesome amounts of data generated by users and serving up relationships tied to that data quickly are forcing web-scale sites like Twitter, Reddit and Facebook to investigate a variety of home-built, open sourced solutions. Here’s what they are using and why it matters.

Facebook's future home for big data

Dealing with the terabytes of data generated by users online and serving up relationships tied to that data quickly are forcing web-scale sites like Twitter, Reddit and Facebook to investigate a variety of home-built, open sourced and hardware solutions, and reject as many closed-source software (such as Oracle) and specialized hardware solutions as possible.

It may seem like a forgone conclusion, but the ideological and practical bias against closed-source software products, as well as specialized physical hardware inherent in web-scale companies, has big implications, not just for vendors like Microsoft and Oracle, if they get locked out of businesses built on the web, but for all businesses. That’s because broadband networks, cloud computing and the shift toward more rapid adoption and integration of web technology into our everyday lives changes the business models and opportunities for all businesses.

The new business models will take into account the need to attract users individually, on a personal level, while also connecting them out to other products they use. These services will be available and designed to be accessed on phones, monitors and any other screen. They are hyper-personal, even for services dictated by corporate IT departments. Because of the “use anywhere” nature of these services, and the myriad connections out to other applications, they will have to manage a lot of user data, a lot of requests from outside a network, and scale out to meet demand.

Given this framework, the panel on scaling open source frameworks past MySQL was one of the more interesting ones at the South by Southwest Interactive conference this weekend. Scalable databases are part of the future of IT for many businesses. You can’t build the types of services discussed above without scalable databases. And those databases, and generally all of the tools used to achieve cheap and agile scale, are open source.

Citing a desire to support open source code, as well as the need to peek under the hood and be able to solve problems quickly, a panel of four guys responsible for building various architectures at Twitter, Facebook, Reddit and Imgur said specifically that they avoid Oracle in favor of rolling their own databases. Most even derided proprietary hardware and specialized networking gear, with the exception of Facebook’s Serkan Piantino, who said the company does use proprietary F5 gear behind software load balancers. Piantino also said that Facebook was testing super fast solid-state hard drives from a company called Fusion IO as a means to speed up access to data.

But for the most part, building your own code and working with open source code ruled the day. Even if there wasn’t an open source solution that was readily available or mature, the consensus was that folks would wait until something was ready, or if the pain was too much, build it themselves. For example, an audience member questioned the panel about any good columnar database stores beyond Hadoop, and Kevin Weil from Twitter explained that there were some closed source options out there, but the open source world’s products were still, ” a little early.” So Twitter does without for now.

Other tidbits of interest on scaling databases from the panel were:

  • Nginx got a big shout out as an alternative to full Apache as a web server.
  • HAProxy is also a popular way to either load balance or merely break requests up to have a cache or a database serve those requests faster
  • Both Twitter and Facebook are using P2P technology (Twitter calls it Murder) to provision services because instead of taking five to 7 minutes to bring one online, it takes 37 seconds.
  • Facebook plans to open source Haystack, its visual storage system within a few months
  • While Hadoop isn’t used much or at all on the front end for both Twitter and Facebook, engineers use it on the back end to deliver granular analytics that otherwise wouldn’t have been possible about how people use the site
  • If you can’t speed up the process with better databases, caching or anything on the software and hardware side, try user interface tricks to make it seem faster, such as saying the video is done uploading even if it isn’t yet.
  • Facebook no longer thinks in terms of servers deployed, it now thinks in terms of deploying entire racks. The software the company is running is rack aware so it can take advantage of all of the bandwidth on a given switch in the rack. It looks like an intermediate step in running your data center as a computer.

For the GigaOM network’s complete SXSW coverage, check out this round-up.

  1. “Web Scale” is quickly changing these days: more and more businesses are becoming data-centric and need to be able to quickly handle hundreds of GB to PB of data in more ways than a traditional RDBMS allows. Hadoop is just one facet on the problem.

    Not to mention if you use a traditional database in the “cloud”, you have twice the headache: all the woes of the RDBMS, plus the new pains associated with the cloud. Right now, we’re not very close to “data center as a computer”.

    It’s why Drawn to Scale (http://www.drawntoscalehq.com) built their new platform. It’s scalable, easy, and real-time, so companies can build their business and not a data infrastructure. :

    Share
  2. Facebook open-sourced Cassandra almost 2 years ago and the project is now completely managed by the community, recently graduating to a top level Apache project. It is in production at Facebook, Digg, and Reddit (and some other places you mentioned are using it, despite the public quotes). Its absence from your article is strange.

    Share
    1. Benjamin, they didn’t spend much time at all talking about Cassandra, which is why it didn’t figure highly. I also tried to point out the stuff that felt newer. As you said, Cassandra was open sourced almost two years ago.

      Share
      1. Hard to reconcile including quotes like “it’s still too early” with “pointing out the newer stuff”, especially when you include references to nginx and HAProxy, which are positively antique by that standard.

        Share
  3. There is MongoDB too. I believed,it will be the leader in this data scale

    Share
    1. That was also mentioned. For more details, check out this twitter stream of the #beyondLAMP hashtag http://twitter.com/search?q=%23beyondlamp

      Share
  4. [...] Needs to Read to Learn What’s Going on in Washington, D.C. See All Articles » When it Comes to Web Scale Go Cheap, Go Custom or [...]

    Share
  5. Not defending here Oracle, but one of the biggest mistakes many people do is that it is not a single tool for all jobs.
    Period. Things like: if Google do this, we better start looking into this one, stinks. It might or it might not.

    Oracle shines in some areas where it has its problems somewhere else. What you refer as specialized hardware is something what you might want or not, again, depending of your business: regarding this majority of Oracle hdw is SPARC and x86 as any IBM or HP would offer as well. As well you mention closed software, Oracle has many sfw offerings including what it has received from SUN and majority of these
    are not closed, instead are released as open-source projects: Java, OpenSolaris, GlassFish, etc. So again it depends…

    I fully agree Nginx is a very capable server but for static content and it has its limitation handling dynamic content via FastCGI. Im using nginx and I really like it for its capabilities but millage might vary. I have evaluated a number of HTTP servers before selecting that – I did not copy paste what I read in Google ;) This is one of the majority mistakes many companies do.

    To summarize all of these:

    • There is a high percentage of copy-paste between Facebook, Twitter, Google and others. They dont even look whats good for them the simple copy-paste.

    • The volume of data makes them drive insane. Can you delete your Facebook profile ? Do they recycle old data which nobody
      uses. Do they have any categorization what means old, obsolete data ?

    • Poor testing: copy-paste adoption leads to poor testing
      and non-existent proof of concept phases where people suck
      things and digest hard. This makes them drive insane later.
      Proper planning and performance analysis is something which
      Facebook, Twitter and many other have no idea about.

    Share
    1. Got any evidence that Facebook and Twitter do not do planning or performance analysis? How do, e.g., http://engineering.twitter.com/2010/02/anatomy-of-whale.html or http://www.facebook.com/note.php?note_id=62667953919 square with your claim?

      Also confused about your copy-paste assertion. Each of those companies has developed — and released as open source — new components that don’t resemble anything released by the other two. In what way is that “simple copy-paste” with no regard to what’s suitable for their own needs? Would also love some examples here.

      In short, I think you’re making a lot of accusations here. On what do you base them?

      Share
      1. This was discussed several times on many other places. Check for instance:
        http://idleprocess.wordpress.com/2009/11/24/presentation-summary-high-performance-at-massive-scale-lessons-learned-at-facebook/

        Ideas like – from Mark Zuckerberg: ”Work fast and don’t be afraid to break things.” Overall, the idea to avoid working cautiously the entire year, delivering rock-solid code, but not much of it. A corollary: if you take the entire site down, it’s not the end of your career. – if we avoid semantics here these words are not encouraging us that performance analysis and testing are king roles there.

        Again, I’m not interesting to bash anyone nor waste my time, but rather signal and be constructive and to do things properly means learning and looking outside the box we always like to live :)

        The copy-paste idea here is that certain software pieces might or might not fit the business idea we want to build. But without testing and module validation, performance analysis that wont fly. These folks usually don’t have time to evaluate anything nor considering doing it. Because of certain other reasons: investors, time to deliver something, etc… They usually end-up “copy-pasting” things between each others (mileage might vary) but that’s the trend.

        Example: this is how for instance MySQL got the attention vs. PostgreSQL, which has more mature and robust features than MySQL. Go figure.

        As a positive note, I hope people will basically look and proper evaluate their needs before implementing something. Thats all what I’m trying to say. All the best

        Share
    2. It’s definitely not cut’n’paste.

      Some of the commonality in software stack across different companies takes place because everyone tries to leverage existing open source projects (faster/cheaper). In this case the system chosen may not be perfect/optimal – but it’s a lot better than starting from scratch.

      Many of these choices are made early in a company’s life and are hard to change later on. A fast growing 1M user company will cut many corners and incur small inefficiencies to move fast. Those inefficiencies can become glaring and expensive once it reaches 400M users. But that doesn’t mean the original choices were wrong – simply that they need to be iterated upon.

      One of the benefits of running a web site/service is that one can change the internals transparently. Being driven insane because of growth is a good problem to have (assuming that there’s an accompanying business model). One can always hire smart engineers and rejigger things (which is what Digg/FB etc. are doing).

      All this said – it is pretty clear that Google has had an enormous impact on web architecture. Very few companies have the level of talent at the top that they do. As i was reading earlier today – good artists copy, great artists steal. You can call it cut’n’paste – but learning from the masters is not a bad thing at all.

      Share
  6. [...] SXSW: When it Comes to Web Scale Go Cheap, Go Custom or Go Home [...]

    Share
  7. Useful post and even better discussion in the comments section. I am no expert and hence loved to read this.Agree quite a lot with Stefan though.

    Share
    1. Not trying to bash anyone but… we need to go back to
      simplicity and automate and test properly what we
      create. We should work less not much ! Machines
      should work for us not we for them ;) A company
      which hires dozens of SysAdmins is not doing right
      even if they have 400USD per share … something
      stinks there.

      My point: “The Right Tool For The Job” is getting a thing
      of the past for all these web wizards. Put your seat belts
      on and wait 3-5 years more to see what these folks will create.

      Common sense and things what we have nowadays somehow
      manageable will be gone, extinct. Bloat dynamic sites with
      data which nobody is interested in and massive deposit of
      family pictures will dominate the future analysts press
      conferences.

      Sorry, I always look from a perspective of a sysadmin :(

      Share
  8. [...] per second per commodity machine using Gizzard. I heard Twitter’s, Kevin Weil talk about the project a few weeks ago at SXSW, and at the time he said the company was building something to help manage distributed data sets [...]

    Share
  9. [...] less energy — albeit at a greater cost. HP sells servers with Fusion-io drives inside, and at SXSW this year Serkan Piantino of Facebook noted that the social network is testing Fusion-io drives. Other customers include MySpace. However, [...]

    Share
  10. [...] may continue to influence folks, even as startups like Twitter, Northscale, Facebook and others are seeking ways to stay on top of the real-time flow of information and offering their own efforts to the open source community. [...]

    Share

Comments have been disabled for this post