While when it comes to cloud computing, no one has entirely sorted out what’s hype and what isn’t, nor exactly how it will be used by the enterprise, what is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling […]

iStock_000003724777SmallWhile when it comes to cloud computing, no one has entirely sorted out what’s hype and what isn’t, nor exactly how it will be used by the enterprise, what is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling Big Data will determine the winners and losers in the next wave of cloud computing innovation.

Data is everywhere (be it from users, applications or machines) and as we get propelled into the “Exabyte Era” (PDF), is growing exponentially; no vertical or industry is being spared. The result is that IT organizations everywhere are being forced to grapple with storing, managing and extracting value from every piece of it -– as cheaply as possible. And so the race to cloud computing has begun.

This isn’t the first time IT architectures have been reinvented in order to remain competitive. The shift from mainframe to client-server was fueled by disruptive innovation in computing horsepower that enabled distributed microprocessing environments. The subsequent shift to web applications/web services during the last decade was enabled by the open networking of applications and services through the Internet buildout. While cloud computing will leverage these prior waves of technology –- computing and networking –- it will also embrace deep innovations in storage/data management to tackle Big Data.

A Big Data stack
But as with prior data center platform shifts, a new “stack” (like mainframe and OSI) will also need to emerge before cloud computing will be broadly embraced by the enterprise. Basic platform capabilities, such as security, access control, application management, virtualization, systems management, provisioning, availability, etc. will have to be standard before IT organizations are able to adopt the cloud completely. In particular, this new cloud framework needs the ability to process data in increasingly real-time and greater orders of magnitude -– and do it at a fraction of what it would typically cost -– by leveraging commodity servers for storage and computing. Maybe cloud computing is all about creating a new “Big Data stack.”

In many ways, this cloud stack has already been implemented, albeit in primitive form, at large-scale Internet data centers, which quickly encountered the scaling limitations of traditional SQL databases as the volume of data exploded. Instead, high-performance, scalable/distributed, object-orientated data stores are being developed internally and implemented at scale. At first, many solved this problem by sharding vast MySQL instances, in essence using them more as data stores than true relational databases (no complex table joins, etc.). As Internet data centers scaled, however, sharding MySQL obviously didn’t.

The rise of DNRDBMS
In response to this, large web properties have been building their own so-called “NoSQL” databases, also known as distributed, non-relational database systems (DNRDBMS). But while it can seem like a different version sprouts up every day, they can largely be categorized into two flavors: One, distributed key value stores, such as Dynamo (Amazon) and Voldemort (LinkedIn); and two, distributed column stores such as Big Table (Google), Cassandra (Facebook), HBase (Yahoo/Hadoop) and Hypertable (Zvents).

These projects are in various stages of deployment and adoption (it is early days, to be sure), but promise to deliver a “cloud-scale” data layer on which applications can be built quickly and elastically, all while having aspects of the reliability/availability of traditional databases. One facet that is common across these myriad of NoSQL databases is a data caching layer, essentially a high-performance, distributed memory caching system that can accelerate web applications by avoiding continual database hits. Memcached’s (disclosure: Accel is an investor in Northscale, parent company of Memcached) broad distribution (which is behind pretty much every Web 2.0 application) has become this de facto layer and is now accepted as a “standard” tier in data centers.

PLIManaging non-transactional data has become even more daunting. From log files to clickstream data to web indexing, Internet data centers are collecting massive volumes of data that need to be processed cheaply in order to drive monetization value. One solution that was been deployed by some of the largest web properties (Yahoo, LinkedIn, Facebook, etc.) for massive parallel computation and distributed file systems in a cloud environment is Hadoop (disclosure: Accel is an investor in Cloudera, the company behind which provides commercial support for Hadoop). In many cases, Hadoop essentially provides an intelligent primary storage and compute layer for the NoSQL databases. Although the framework has roots in Internet data centers, Hadoop is quickly penetrating broader enterprise use cases, as the diverse set of participants at the recent Hadoop World NYC event made clear.

As this cloud stack hardens, new applications and services –- previously unthinkable -– will come to light, in all shapes and sizes. But the one thing they will all have in common is Big Data.

Ping Li is a partner with Accel.

You’re subscribed! If you like, you can update your settings

  1. CouchDB should also be added to the list: http://couchdb.apache.org/.

    It also has a pretty cool REST, JSON interface. I’ve only played with it so far but seems to be getting a lot of attention…


    1. Alex – Thanks for bringing it up. Agree that couchDB should be on the list (as well as mongodb, riak and others). There’s a lot of interesting innovation here and couldn’t include them all. All have been optimized for various environments and will be interesting to see which ones emerge in the coming months. I’ll be tracking and updating at: http://twitter.com/ping_accel.

  2. Great thought leadership piece Ping…

  3. Great thought leadership piece Ping…

  4. Ehmm, no.
    What was the orig. Problem RDBMS were designed to solve?
    The problem at the time was data explosion and data storage cost. Just take a look at the org. IBM papers. Also there was no what can be called Standardized Query Language, don’t worry I know. I just want to make the point that the language helped with the success of RDBMS.
    Anyway what is the real problem of the 21th century, data access speed or information processing?
    Now (simplified):
    Information = data in context
    Context = organized data
    learning = self organization of data (builds context)

    A really slow human brain, signal data transfer rate, beats Google any time extracting information out of a heap data. Google is only fast in providing data and if humans would not have provided the base organization,links, Google would even fail at that.

    So if the problem of the future is providing information all the systems you mention are already obsolete or build to solve the wrong problem. IFF you solve the information problem you solve the access speed problem in the process, see above. But speed doesn’t solve the information problem.

  5. Johannes Ernst Sunday, October 25, 2009

    I buy that there’ll be a new stack.

    I don’t buy that it’s #1 characteristic is “big data”. Why not “big number of users” or “big number of computers” or — in an enterprise context — “big supply chain”?

  6. Daniel Horowitz Sunday, October 25, 2009

    Great peice, you hit the nail right on the head but you left out an important vendor. 1010data has been providing cloud based data management solutions for wall street, retail, pharma, consumer products, supply chain, etc… for the last 10 years. We provide a web based analytical DW/BI solution that focuses on solving large analytical problems.

    The NYSE uses us -http://www.a-teamgroup.com/article/nyse-upgrades-1010data-platform-in-support-of-high-performance-market-data-analytics/
    Dollar General uses us – http://www.b-eye-network.com/channels/5084/view/11723/

    and over 80 wall street firms rely on our service daily for ad-hoc and production analysis of mortgage and credit data.

  7. Interesting post.

    I would correct two things:
    1) memcached isn’t necessarily required in front of a DNRDBMS. It depends on the throughput of the database.
    2) The company behind Hadoop is Yahoo!. Facebook is also a large contributor, via Hive. Cloudera contributes a very minor amount of code, compared to Yahoo!.

    1. Two other corrections:
      HBase is from Powerset acquired by Microsoft; HBASE is *not* from Yahoo.
      Further, Yahoo is the major contributor for Hadoop; Facebook has also contributed to
      Hadoop. On the other hand, Cloudera is company trying to monetize Hadoop; Cloudera, so far, has contributing a few minor things but most of their key technologies (rpm scripts, sbase plugin, their desktop) are *not* put back in the Hadoop apache open source base.

  8. Great article! It is very relevant to internet and cloud computing’s future. A fast database is indeed the gateway to faster internet and could computing. Our biggest bottleneck and most of our efforts at http://www.binfire.com goes to improve the database performance.

  9. The Future Is Big Data in the Cloud | Digital Asset Management Monday, October 26, 2009

    [...] The Future Is Big Data in the Cloud . [...]

  10. alexis richardson Monday, October 26, 2009

    This article is ‘spot on’. Anyone in this space should also download Ping Li’s paper which is well worth a read.

    One quote from the paper really nails it. Rich Wolski, CTO and founder of Eucalyptus, says: “pretty much everything you own is going to be trying to send you data”. That’s damn straight. It’s also the gap in Ping’s analysis. Yes we need storage, and yes we need caching. We also need messaging systems that can deliver data to where it is needed, not just ‘within the cloud’, but also possibly across private and public data centers and to client machines. These delivery systems have to be scalable, programmable, secure and reliable. Try RabbitMQ for this.

    Alexis Richardson
    CEO, Rabbit Technologies Ltd.

  11. Looking forward to seeing how this shift unfolds in the enterprise.

  12. Hadoop Truth Squad Tuesday, October 27, 2009

    So any chance there will be a correction as to the obviously false statement that Cloudera is the company behind Hadoop? As pointed out Yahoo! has more than 11 committers on the project, while Cloudera was found with one and just hired away a second. Moreover, none of the recent graduates Cloudera have hired have contributed any major code to Hadoop. Essentially Cloudera is a tech support company attempting to steal the name Hadoop. And while it’s reasonable, as their investor, to puff up their contribution, outright falsehoods should be avoided, right? Cloudera is not the company behind Hadoop. Cloudera is the company positioning itself to make the most money of Hadoop’s potential. Nothing wrong with that, but let’s be accurate. Now, how about a correction to the article?

  13. Absolutely agree with above comments re Cloudera behind Hadoop !

    Everyone knows that is not true and they haven’t contributed anything major yet compared to what Yahoo and Facebook has been doing. Let’s be real here.

    1. The funny thing is that when Cloudera finally wrote some code (what they call Cloudera Desktop), they didn’t even open source it.

      Cloudera, try to focus less on PR, and more on contributing to the project. It might actually pay off…

      1. Interesting article. Totally agree on Hadoop and Cloudera. Hadoop’s license doesnt stop anyone to become a Cloudera. It really is a tech support with “Smooth deployments” support

        But I wont be surprised if Cloudera will not be developing more “close source” toolset to monetize hadoop. Which is where they can succeed.

        I like what they doing. Although there are some other players who are emerging on the scene

  14. Carolyn Pritchard Tuesday, October 27, 2009

    @Hadoop Truth Squad, @ Hadoop User — That error was inserted during the editing process and has been corrected; thanks very much for pointing it out.

    best, Carolyn

    1. Thanks Carolyn for updating. Thanks to hadoop users for making the clarification. Yahoo and Facebook has indeed been meangingful contributors to Hadoop.

  15. Nicolai Wadstrom on Entrepreneurship » Blog Archive » The Future Is Big Data in the Cloud Monday, November 2, 2009

    [...] the full “The Future Is Big Data in the Cloud“ Reading: The Future Is Big Data in the CloudTweet This: Send Page to [...]

  16. NoSQL meetup, reporton | Zemanta Ltd. Tuesday, November 3, 2009

    [...] The Future Is Big Data in the Cloud (gigaom.com) [...]

  17. Northscale is not the company behind memcached. They are contributors and supply commercial support.

    1. Totally agree that Northscale is not behind memcached. I saw a tweet by the CEO that they have managed to run memcached as a windows (Win32) Service. WOW if thats an achievement then I dont know where the real innovation is in Northscale – we had that running in our Grid for last 6 months – both 32bit and 64bit windows!!!

  18. Flow » Blog Archive » The Current Flow, Mid-November 2009 – The zeitgeist daily Wednesday, November 18, 2009

    [...] Real time web — already through Twitter, Facebook and their forthcoming peers like life recorders and logs — is building the living digital sphere around the world, in the cloud. [...]

  19. Check Out a Big Primer on Big Data – GigaOM Wednesday, March 3, 2010

    [...] articulate its influences. Data will be the same way in the not-t00-distant future, thanks to cheap, scalable computing and ubiquitous broadband enabling a connected [...]

  20. Big Data Is Less About Size, And More About Freedom Sunday, June 20, 2010

    [...] the “big data” mantra is misguided at times. For instance, a GigaOm article about big data in the cloud states: What is becoming increasingly clear is that Big Data is the future of IT. To that end, [...]

  21. DemandTec Explains How It Keeps Big Data on Target Monday, June 21, 2010

    [...] in software and technology. But stores still need companies like DemandTec to help them turn the plentiful straw of digital data into predictive gold, a service that becomes more challenging as more data is [...]

  22. Big Data Is Less About Size, And More About Freedom « Gadget Fee Tuesday, June 29, 2010

    [...] the “big data” mantra is misguided at times. For instance, a GigaOm article about big data in the cloud states: What is becoming increasingly clear is that Big Data is the future of IT. To that end, [...]

  23. Venture Capital’s Data Side Story Tuesday, August 10, 2010

    [...] The Future Is Big Data in the Cloud [...]

  24. Someone Is Trading Stocks Based on Your Tweets: Tech News « Thursday, December 23, 2010

    [...] In a sense, they are doing the same thing Google does, but they are doing it in order to figure out which stocks to buy, not because they want to serve up related advertising. This is just part of the future of what we call “big data.” [...]

Comments have been disabled for this post