The Future Is Big Data in the Cloud


iStock_000003724777SmallWhile when it comes to cloud computing, no one has entirely sorted out what’s hype and what isn’t, nor exactly how it will be used by the enterprise, what is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling Big Data will determine the winners and losers in the next wave of cloud computing innovation.

Data is everywhere (be it from users, applications or machines) and as we get propelled into the “Exabyte Era” (PDF), is growing exponentially; no vertical or industry is being spared. The result is that IT organizations everywhere are being forced to grapple with storing, managing and extracting value from every piece of it -– as cheaply as possible. And so the race to cloud computing has begun.

This isn’t the first time IT architectures have been reinvented in order to remain competitive. The shift from mainframe to client-server was fueled by disruptive innovation in computing horsepower that enabled distributed microprocessing environments. The subsequent shift to web applications/web services during the last decade was enabled by the open networking of applications and services through the Internet buildout. While cloud computing will leverage these prior waves of technology –- computing and networking –- it will also embrace deep innovations in storage/data management to tackle Big Data.

A Big Data stack
But as with prior data center platform shifts, a new “stack” (like mainframe and OSI) will also need to emerge before cloud computing will be broadly embraced by the enterprise. Basic platform capabilities, such as security, access control, application management, virtualization, systems management, provisioning, availability, etc. will have to be standard before IT organizations are able to adopt the cloud completely. In particular, this new cloud framework needs the ability to process data in increasingly real-time and greater orders of magnitude -– and do it at a fraction of what it would typically cost -– by leveraging commodity servers for storage and computing. Maybe cloud computing is all about creating a new “Big Data stack.”

In many ways, this cloud stack has already been implemented, albeit in primitive form, at large-scale Internet data centers, which quickly encountered the scaling limitations of traditional SQL databases as the volume of data exploded. Instead, high-performance, scalable/distributed, object-orientated data stores are being developed internally and implemented at scale. At first, many solved this problem by sharding vast MySQL instances, in essence using them more as data stores than true relational databases (no complex table joins, etc.). As Internet data centers scaled, however, sharding MySQL obviously didn’t.

The rise of DNRDBMS
In response to this, large web properties have been building their own so-called “NoSQL” databases, also known as distributed, non-relational database systems (DNRDBMS). But while it can seem like a different version sprouts up every day, they can largely be categorized into two flavors: One, distributed key value stores, such as Dynamo (Amazon) and Voldemort (LinkedIn); and two, distributed column stores such as Big Table (Google), Cassandra (Facebook), HBase (Yahoo/Hadoop) and Hypertable (Zvents).

These projects are in various stages of deployment and adoption (it is early days, to be sure), but promise to deliver a “cloud-scale” data layer on which applications can be built quickly and elastically, all while having aspects of the reliability/availability of traditional databases. One facet that is common across these myriad of NoSQL databases is a data caching layer, essentially a high-performance, distributed memory caching system that can accelerate web applications by avoiding continual database hits. Memcached’s (disclosure: Accel is an investor in Northscale, parent company of Memcached) broad distribution (which is behind pretty much every Web 2.0 application) has become this de facto layer and is now accepted as a “standard” tier in data centers.

PLIManaging non-transactional data has become even more daunting. From log files to clickstream data to web indexing, Internet data centers are collecting massive volumes of data that need to be processed cheaply in order to drive monetization value. One solution that was been deployed by some of the largest web properties (Yahoo, LinkedIn, Facebook, etc.) for massive parallel computation and distributed file systems in a cloud environment is Hadoop (disclosure: Accel is an investor in Cloudera, the company behind which provides commercial support for Hadoop). In many cases, Hadoop essentially provides an intelligent primary storage and compute layer for the NoSQL databases. Although the framework has roots in Internet data centers, Hadoop is quickly penetrating broader enterprise use cases, as the diverse set of participants at the recent Hadoop World NYC event made clear.

As this cloud stack hardens, new applications and services –- previously unthinkable -– will come to light, in all shapes and sizes. But the one thing they will all have in common is Big Data.

Ping Li is a partner with Accel.


Shahbaz Ali

Totally agree that Northscale is not behind memcached. I saw a tweet by the CEO that they have managed to run memcached as a windows (Win32) Service. WOW if thats an achievement then I dont know where the real innovation is in Northscale – we had that running in our Grid for last 6 months – both 32bit and 64bit windows!!!

Carolyn Pritchard

@Hadoop Truth Squad, @ Hadoop User — That error was inserted during the editing process and has been corrected; thanks very much for pointing it out.

best, Carolyn


Thanks Carolyn for updating. Thanks to hadoop users for making the clarification. Yahoo and Facebook has indeed been meangingful contributors to Hadoop.

Hadoop User

Absolutely agree with above comments re Cloudera behind Hadoop !

Everyone knows that is not true and they haven’t contributed anything major yet compared to what Yahoo and Facebook has been doing. Let’s be real here.

Another Hadoop User

The funny thing is that when Cloudera finally wrote some code (what they call Cloudera Desktop), they didn’t even open source it.

Cloudera, try to focus less on PR, and more on contributing to the project. It might actually pay off…

Shahbaz Ali

Interesting article. Totally agree on Hadoop and Cloudera. Hadoop’s license doesnt stop anyone to become a Cloudera. It really is a tech support with “Smooth deployments” support

But I wont be surprised if Cloudera will not be developing more “close source” toolset to monetize hadoop. Which is where they can succeed.

I like what they doing. Although there are some other players who are emerging on the scene

Hadoop Truth Squad

So any chance there will be a correction as to the obviously false statement that Cloudera is the company behind Hadoop? As pointed out Yahoo! has more than 11 committers on the project, while Cloudera was found with one and just hired away a second. Moreover, none of the recent graduates Cloudera have hired have contributed any major code to Hadoop. Essentially Cloudera is a tech support company attempting to steal the name Hadoop. And while it’s reasonable, as their investor, to puff up their contribution, outright falsehoods should be avoided, right? Cloudera is not the company behind Hadoop. Cloudera is the company positioning itself to make the most money of Hadoop’s potential. Nothing wrong with that, but let’s be accurate. Now, how about a correction to the article?

alexis richardson

This article is ‘spot on’. Anyone in this space should also download Ping Li’s paper which is well worth a read.

One quote from the paper really nails it. Rich Wolski, CTO and founder of Eucalyptus, says: “pretty much everything you own is going to be trying to send you data”. That’s damn straight. It’s also the gap in Ping’s analysis. Yes we need storage, and yes we need caching. We also need messaging systems that can deliver data to where it is needed, not just ‘within the cloud’, but also possibly across private and public data centers and to client machines. These delivery systems have to be scalable, programmable, secure and reliable. Try RabbitMQ for this.

Alexis Richardson
CEO, Rabbit Technologies Ltd.

David Robins

Great article! It is very relevant to internet and cloud computing’s future. A fast database is indeed the gateway to faster internet and could computing. Our biggest bottleneck and most of our efforts at goes to improve the database performance.

Tim Gray

Interesting post.

I would correct two things:
1) memcached isn’t necessarily required in front of a DNRDBMS. It depends on the throughput of the database.
2) The company behind Hadoop is Yahoo!. Facebook is also a large contributor, via Hive. Cloudera contributes a very minor amount of code, compared to Yahoo!.


Two other corrections:
HBase is from Powerset acquired by Microsoft; HBASE is *not* from Yahoo.
Further, Yahoo is the major contributor for Hadoop; Facebook has also contributed to
Hadoop. On the other hand, Cloudera is company trying to monetize Hadoop; Cloudera, so far, has contributing a few minor things but most of their key technologies (rpm scripts, sbase plugin, their desktop) are *not* put back in the Hadoop apache open source base.

Daniel Horowitz

Great peice, you hit the nail right on the head but you left out an important vendor. 1010data has been providing cloud based data management solutions for wall street, retail, pharma, consumer products, supply chain, etc… for the last 10 years. We provide a web based analytical DW/BI solution that focuses on solving large analytical problems.

The NYSE uses us -
Dollar General uses us –

and over 80 wall street firms rely on our service daily for ad-hoc and production analysis of mortgage and credit data.

Johannes Ernst

I buy that there’ll be a new stack.

I don’t buy that it’s #1 characteristic is “big data”. Why not “big number of users” or “big number of computers” or — in an enterprise context — “big supply chain”?


Ehmm, no.
What was the orig. Problem RDBMS were designed to solve?
The problem at the time was data explosion and data storage cost. Just take a look at the org. IBM papers. Also there was no what can be called Standardized Query Language, don’t worry I know. I just want to make the point that the language helped with the success of RDBMS.
Anyway what is the real problem of the 21th century, data access speed or information processing?
Now (simplified):
Information = data in context
Context = organized data
learning = self organization of data (builds context)

A really slow human brain, signal data transfer rate, beats Google any time extracting information out of a heap data. Google is only fast in providing data and if humans would not have provided the base organization,links, Google would even fail at that.

So if the problem of the future is providing information all the systems you mention are already obsolete or build to solve the wrong problem. IFF you solve the information problem you solve the access speed problem in the process, see above. But speed doesn’t solve the information problem.


CouchDB should also be added to the list:

It also has a pretty cool REST, JSON interface. I’ve only played with it so far but seems to be getting a lot of attention…


Ping Li

Alex – Thanks for bringing it up. Agree that couchDB should be on the list (as well as mongodb, riak and others). There’s a lot of interesting innovation here and couldn’t include them all. All have been optimized for various environments and will be interesting to see which ones emerge in the coming months. I’ll be tracking and updating at:

Comments are closed.