6 Comments

Summary:

Is Hadoop our only hope for solving big data challenges? From scalability to fault tolerance, Hadoop does myriad things very well. Yet, Hadoop is not the solution to all big data problems and use cases. Several key issues remain, including investment, complexity and batch-only processing.

Freedom-of-choice-a22077920

There has been an absolute explosion in the data space recently. Devices, consumers and companies are not only producing data at a stellar pace, but also introducing amazing complexity into the data itself. It is a rare day without a freshly funded company, a new product launch or a front-page acquisition around data to help address these issues. The level of required scale and sophistication of even a small startup is accelerating along with increasing expectations for “real-time” answers.

The good news is that along with these great challenges, we are also seeing an explosion in innovative tools and techniques. One of the most renowned is Hadoop, Apache’s implementation of Google’s MapReduce software that makes data-parallel processing efficient, affordable and approachable even for users without enormous engineering teams.

But is Hadoop our only hope? From scalability to fault tolerance, Hadoop does myriad things very well. Yet, Hadoop is not the solution to all big data problems and use cases. Several key issues remain:

Investment.  Even with the explosion of tools, support and services surrounding Hadoop, getting Hadoop deployed in production requires a significant investment in terms of training users and tuning clusters and workflows for optimal usage. Hadoop’s complexity opens the door for support and services organizations such as Cloudera, as well as interface plays like Datameer and its spreadsheet-based user interface. Also, it’s surprising that the industry hasn’t seen far more growth around hosted Hadoop solutions like Amazon’s Elastic MapReduce, which directly targets these capex and opex issues.

Data Complexity. MapReduce is an excellent paradigm for very efficient parallel execution of many analytic algorithms, but it isn’t one-size-fits-all. As the issue of data complexity continues to grow, some problems are simply ill-suited for MapReduce. In particular, the very same social graphs that are driving a large sector of the big data problem are poorly matched to the underlying design choices of MapReduce. Although brute force techniques can be applied, they ultimately fail by leading to inefficiencies that can significantly impact a business’s bottom line.

Batch. I find this issue the most compelling. Big data, for better or worse, is still generally confined to the data warehouse. That largely means “offline” data that is subject to the classic extract-transform-load (ETL) workflow. Hadoop helps minimize the turnaround time for ETL, but batch processing still means something more akin to “tomorrow” than “real-time.” In contrast, Google’s original MapReduce paper gave an inspiring example of inserting MapReduce directly into a sequential C++ program for efficient real-time computation. Unfortunately, we find far too few real-time examples in production.

As a big data solution provider myself, I think we should set our sights much, much higher than simply handling more data in the warehouse. Instead, we should integrate the concepts of scalability, fault tolerance and efficient parallel computation into the very systems that drive the end-user experience. Many of the players in the nascent NoSQL database market actually aim to do precisely that, but those tools come with their own issues and effects that could fill an entirely separate post.

The emergence of Hadoop has gone a very long way toward democratizing big data, but big data spans across important issues that Hadoop alone doesn’t address. While large engineering shops such as Facebook are able to deal with these challenges, the intellectual and capital expenditures required for success reinforce that our revolution is still far from complete. As the market matures, I expect big data’s near future includes technological and market responses to address these increasingly important issues.

Mike Miller is co-founder and chief scientist at Cloudant.

Image courtesy of Krzysztof Poltorak.

You’re subscribed! If you like, you can update your settings

  1. how developing countries can get the same thing with developing countries? data storage capabilities are always at issue and the cost of connection into something a rare commodity. There is no balance between the economic pattern with the market in the world.

    large manufacturers would want to find what they want in developing countries.

  2. Hadoop is great, but there are a lot of other solutions that work well based on your exact requirments. Hadoop was too bulky for us so we ended up using CouchDB for our large data storage.

  3. Vladimir Rodionov Sunday, June 12, 2011

    Big Data has nothing to do with democracy. Big Data belongs to Big corporations which are total opposite to democracy itself.

    1. Agree with Vladimir. While processing Big Data might become cheaper, acquiring it will be the real cost.

  4. I think we are at a near tipping point when it comes to “Big Data” and next-gen data processing. The good news is that we are seeing tools hit the market that can help us finally start leveraging all of that data we’ve been piling up in our organizations. Data volume is becoming less and less of a scary thing.

    Hadoop has been popular as a “first to market” MapReduce tool – but it has simply proven that massively parallel processing doesn’t have to be a hugely complex endeavor.

    Many more methodologies and tools will be popping up; all with different approaches to MapReduce, and some will be right for certain use cases.

    Even at my company Basho Technologies we have just today released into beta a new take on MapReduce called “Riak Pipe” – if you’re looking for a new way to approach MapReduce, check this out:

    http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-June/004550.html

    …just one more step towards further democratizing big data…

  5. Mike,

    Good article. Your point is well taken about more real-time leverage of Big Data as well as the statement about Facebook-sized capital expenditures. I applaud Cloudant for focusing on cloud-based couchDB hosting. In addition to batch-processing, and real-time access, I would like to add certain use cases are also about the ability to retain data data at scale. This is a compliance requirement, where access may be occasional so TCO of the data is the primary concern. In this instance, significant data reduction or compression is needed to make keeping the data on premise (or in the cloud – since uncompressed data transmission rates would not support the Petabytes being generated).

Comments have been disabled for this post