20 Comments

Summary:

While when it comes to cloud computing, no one has entirely sorted out what’s hype and what isn’t, nor exactly how it will be used by the enterprise, what is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling […]

iStock_000003724777SmallWhile when it comes to cloud computing, no one has entirely sorted out what’s hype and what isn’t, nor exactly how it will be used by the enterprise, what is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling Big Data will determine the winners and losers in the next wave of cloud computing innovation.

Data is everywhere (be it from users, applications or machines) and as we get propelled into the “Exabyte Era” (PDF), is growing exponentially; no vertical or industry is being spared. The result is that IT organizations everywhere are being forced to grapple with storing, managing and extracting value from every piece of it -– as cheaply as possible. And so the race to cloud computing has begun.

This isn’t the first time IT architectures have been reinvented in order to remain competitive. The shift from mainframe to client-server was fueled by disruptive innovation in computing horsepower that enabled distributed microprocessing environments. The subsequent shift to web applications/web services during the last decade was enabled by the open networking of applications and services through the Internet buildout. While cloud computing will leverage these prior waves of technology –- computing and networking –- it will also embrace deep innovations in storage/data management to tackle Big Data.

A Big Data stack
But as with prior data center platform shifts, a new “stack” (like mainframe and OSI) will also need to emerge before cloud computing will be broadly embraced by the enterprise. Basic platform capabilities, such as security, access control, application management, virtualization, systems management, provisioning, availability, etc. will have to be standard before IT organizations are able to adopt the cloud completely. In particular, this new cloud framework needs the ability to process data in increasingly real-time and greater orders of magnitude -– and do it at a fraction of what it would typically cost -– by leveraging commodity servers for storage and computing. Maybe cloud computing is all about creating a new “Big Data stack.”

In many ways, this cloud stack has already been implemented, albeit in primitive form, at large-scale Internet data centers, which quickly encountered the scaling limitations of traditional SQL databases as the volume of data exploded. Instead, high-performance, scalable/distributed, object-orientated data stores are being developed internally and implemented at scale. At first, many solved this problem by sharding vast MySQL instances, in essence using them more as data stores than true relational databases (no complex table joins, etc.). As Internet data centers scaled, however, sharding MySQL obviously didn’t.

The rise of DNRDBMS
In response to this, large web properties have been building their own so-called “NoSQL” databases, also known as distributed, non-relational database systems (DNRDBMS). But while it can seem like a different version sprouts up every day, they can largely be categorized into two flavors: One, distributed key value stores, such as Dynamo (Amazon) and Voldemort (LinkedIn); and two, distributed column stores such as Big Table (Google), Cassandra (Facebook), HBase (Yahoo/Hadoop) and Hypertable (Zvents).

These projects are in various stages of deployment and adoption (it is early days, to be sure), but promise to deliver a “cloud-scale” data layer on which applications can be built quickly and elastically, all while having aspects of the reliability/availability of traditional databases. One facet that is common across these myriad of NoSQL databases is a data caching layer, essentially a high-performance, distributed memory caching system that can accelerate web applications by avoiding continual database hits. Memcached’s (disclosure: Accel is an investor in Northscale, parent company of Memcached) broad distribution (which is behind pretty much every Web 2.0 application) has become this de facto layer and is now accepted as a “standard” tier in data centers.

PLIManaging non-transactional data has become even more daunting. From log files to clickstream data to web indexing, Internet data centers are collecting massive volumes of data that need to be processed cheaply in order to drive monetization value. One solution that was been deployed by some of the largest web properties (Yahoo, LinkedIn, Facebook, etc.) for massive parallel computation and distributed file systems in a cloud environment is Hadoop (disclosure: Accel is an investor in Cloudera, the company behind which provides commercial support for Hadoop). In many cases, Hadoop essentially provides an intelligent primary storage and compute layer for the NoSQL databases. Although the framework has roots in Internet data centers, Hadoop is quickly penetrating broader enterprise use cases, as the diverse set of participants at the recent Hadoop World NYC event made clear.

As this cloud stack hardens, new applications and services –- previously unthinkable -– will come to light, in all shapes and sizes. But the one thing they will all have in common is Big Data.

Ping Li is a partner with Accel.

You’re subscribed! If you like, you can update your settings

  1. CouchDB should also be added to the list: http://couchdb.apache.org/.

    It also has a pretty cool REST, JSON interface. I’ve only played with it so far but seems to be getting a lot of attention…

    Alex

    1. Alex – Thanks for bringing it up. Agree that couchDB should be on the list (as well as mongodb, riak and others). There’s a lot of interesting innovation here and couldn’t include them all. All have been optimized for various environments and will be interesting to see which ones emerge in the coming months. I’ll be tracking and updating at: http://twitter.com/ping_accel.

  2. Great thought leadership piece Ping…

  3. Great thought leadership piece Ping…

  4. Ehmm, no.
    What was the orig. Problem RDBMS were designed to solve?
    The problem at the time was data explosion and data storage cost. Just take a look at the org. IBM papers. Also there was no what can be called Standardized Query Language, don’t worry I know. I just want to make the point that the language helped with the success of RDBMS.
    Anyway what is the real problem of the 21th century, data access speed or information processing?
    Now (simplified):
    Information = data in context
    Context = organized data
    learning = self organization of data (builds context)

    A really slow human brain, signal data transfer rate, beats Google any time extracting information out of a heap data. Google is only fast in providing data and if humans would not have provided the base organization,links, Google would even fail at that.

    So if the problem of the future is providing information all the systems you mention are already obsolete or build to solve the wrong problem. IFF you solve the information problem you solve the access speed problem in the process, see above. But speed doesn’t solve the information problem.

  5. Johannes Ernst Sunday, October 25, 2009

    I buy that there’ll be a new stack.

    I don’t buy that it’s #1 characteristic is “big data”. Why not “big number of users” or “big number of computers” or — in an enterprise context — “big supply chain”?

  6. Daniel Horowitz Sunday, October 25, 2009

    Great peice, you hit the nail right on the head but you left out an important vendor. 1010data has been providing cloud based data management solutions for wall street, retail, pharma, consumer products, supply chain, etc… for the last 10 years. We provide a web based analytical DW/BI solution that focuses on solving large analytical problems.

    The NYSE uses us -http://www.a-teamgroup.com/article/nyse-upgrades-1010data-platform-in-support-of-high-performance-market-data-analytics/
    Dollar General uses us – http://www.b-eye-network.com/channels/5084/view/11723/

    and over 80 wall street firms rely on our service daily for ad-hoc and production analysis of mortgage and credit data.

  7. Interesting post.

    I would correct two things:
    1) memcached isn’t necessarily required in front of a DNRDBMS. It depends on the throughput of the database.
    2) The company behind Hadoop is Yahoo!. Facebook is also a large contributor, via Hive. Cloudera contributes a very minor amount of code, compared to Yahoo!.

    1. Two other corrections:
      HBase is from Powerset acquired by Microsoft; HBASE is *not* from Yahoo.
      Further, Yahoo is the major contributor for Hadoop; Facebook has also contributed to
      Hadoop. On the other hand, Cloudera is company trying to monetize Hadoop; Cloudera, so far, has contributing a few minor things but most of their key technologies (rpm scripts, sbase plugin, their desktop) are *not* put back in the Hadoop apache open source base.

  8. Great article! It is very relevant to internet and cloud computing’s future. A fast database is indeed the gateway to faster internet and could computing. Our biggest bottleneck and most of our efforts at http://www.binfire.com goes to improve the database performance.

  9. The Future Is Big Data in the Cloud | Digital Asset Management Monday, October 26, 2009

    [...] The Future Is Big Data in the Cloud . [...]

  10. alexis richardson Monday, October 26, 2009

    This article is ‘spot on’. Anyone in this space should also download Ping Li’s paper which is well worth a read.

    One quote from the paper really nails it. Rich Wolski, CTO and founder of Eucalyptus, says: “pretty much everything you own is going to be trying to send you data”. That’s damn straight. It’s also the gap in Ping’s analysis. Yes we need storage, and yes we need caching. We also need messaging systems that can deliver data to where it is needed, not just ‘within the cloud’, but also possibly across private and public data centers and to client machines. These delivery systems have to be scalable, programmable, secure and reliable. Try RabbitMQ for this.

    Alexis Richardson
    CEO, Rabbit Technologies Ltd.

Comments have been disabled for this post