The Future Is Big Data in the Cloud

iStock_000003724777SmallWhile when it comes to cloud computing, no one has entirely sorted out what’s hype and what isn’t, nor exactly how it will be used by the enterprise, what is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling Big Data will determine the winners and losers in the next wave of cloud computing innovation.

Data is everywhere (be it from users, applications or machines) and as we get propelled into the “Exabyte Era” (PDF), is growing exponentially; no vertical or industry is being spared. The result is that IT organizations everywhere are being forced to grapple with storing, managing and extracting value from every piece of it -– as cheaply as possible. And so the race to cloud computing has begun.

This isn’t the first time IT architectures have been reinvented in order to remain competitive. The shift from mainframe to client-server was fueled by disruptive innovation in computing horsepower that enabled distributed microprocessing environments. The subsequent shift to web applications/web services during the last decade was enabled by the open networking of applications and services through the Internet buildout. While cloud computing will leverage these prior waves of technology –- computing and networking –- it will also embrace deep innovations in storage/data management to tackle Big Data.

A Big Data stack
But as with prior data center platform shifts, a new “stack” (like mainframe and OSI) will also need to emerge before cloud computing will be broadly embraced by the enterprise. Basic platform capabilities, such as security, access control, application management, virtualization, systems management, provisioning, availability, etc. will have to be standard before IT organizations are able to adopt the cloud completely. In particular, this new cloud framework needs the ability to process data in increasingly real-time and greater orders of magnitude -– and do it at a fraction of what it would typically cost -– by leveraging commodity servers for storage and computing. Maybe cloud computing is all about creating a new “Big Data stack.”

In many ways, this cloud stack has already been implemented, albeit in primitive form, at large-scale Internet data centers, which quickly encountered the scaling limitations of traditional SQL databases as the volume of data exploded. Instead, high-performance, scalable/distributed, object-orientated data stores are being developed internally and implemented at scale. At first, many solved this problem by sharding vast MySQL instances, in essence using them more as data stores than true relational databases (no complex table joins, etc.). As Internet data centers scaled, however, sharding MySQL obviously didn’t.

The rise of DNRDBMS
In response to this, large web properties have been building their own so-called “NoSQL” databases, also known as distributed, non-relational database systems (DNRDBMS). But while it can seem like a different version sprouts up every day, they can largely be categorized into two flavors: One, distributed key value stores, such as Dynamo (Amazon) and Voldemort (LinkedIn); and two, distributed column stores such as Big Table (Google), Cassandra (Facebook), HBase (Yahoo/Hadoop) and Hypertable (Zvents).

These projects are in various stages of deployment and adoption (it is early days, to be sure), but promise to deliver a “cloud-scale” data layer on which applications can be built quickly and elastically, all while having aspects of the reliability/availability of traditional databases. One facet that is common across these myriad of NoSQL databases is a data caching layer, essentially a high-performance, distributed memory caching system that can accelerate web applications by avoiding continual database hits. Memcached’s (disclosure: Accel is an investor in Northscale, parent company of Memcached) broad distribution (which is behind pretty much every Web 2.0 application) has become this de facto layer and is now accepted as a “standard” tier in data centers.

PLIManaging non-transactional data has become even more daunting. From log files to clickstream data to web indexing, Internet data centers are collecting massive volumes of data that need to be processed cheaply in order to drive monetization value. One solution that was been deployed by some of the largest web properties (Yahoo, LinkedIn, Facebook, etc.) for massive parallel computation and distributed file systems in a cloud environment is Hadoop (disclosure: Accel is an investor in Cloudera, the company behind which provides commercial support for Hadoop). In many cases, Hadoop essentially provides an intelligent primary storage and compute layer for the NoSQL databases. Although the framework has roots in Internet data centers, Hadoop is quickly penetrating broader enterprise use cases, as the diverse set of participants at the recent Hadoop World NYC event made clear.

As this cloud stack hardens, new applications and services –- previously unthinkable -– will come to light, in all shapes and sizes. But the one thing they will all have in common is Big Data.

Ping Li is a partner with Accel.