There has been an absolute explosion in the data space recently. Devices, consumers and companies are not only producing data at a stellar pace, but also introducing amazing complexity into the data itself. It is a rare day without a freshly funded company, a new product launch or a front-page acquisition around data to help address these issues. The level of required scale and sophistication of even a small startup is accelerating along with increasing expectations for “real-time” answers.
The good news is that along with these great challenges, we are also seeing an explosion in innovative tools and techniques. One of the most renowned is Hadoop, Apache’s implementation of Google’s MapReduce software that makes data-parallel processing efficient, affordable and approachable even for users without enormous engineering teams.
But is Hadoop our only hope? From scalability to fault tolerance, Hadoop does myriad things very well. Yet, Hadoop is not the solution to all big data problems and use cases. Several key issues remain:
Investment. Even with the explosion of tools, support and services surrounding Hadoop, getting Hadoop deployed in production requires a significant investment in terms of training users and tuning clusters and workflows for optimal usage. Hadoop’s complexity opens the door for support and services organizations such as Cloudera, as well as interface plays like Datameer and its spreadsheet-based user interface. Also, it’s surprising that the industry hasn’t seen far more growth around hosted Hadoop solutions like Amazon’s Elastic MapReduce, which directly targets these capex and opex issues.
Data Complexity. MapReduce is an excellent paradigm for very efficient parallel execution of many analytic algorithms, but it isn’t one-size-fits-all. As the issue of data complexity continues to grow, some problems are simply ill-suited for MapReduce. In particular, the very same social graphs that are driving a large sector of the big data problem are poorly matched to the underlying design choices of MapReduce. Although brute force techniques can be applied, they ultimately fail by leading to inefficiencies that can significantly impact a business’s bottom line.
Batch. I find this issue the most compelling. Big data, for better or worse, is still generally confined to the data warehouse. That largely means “offline” data that is subject to the classic extract-transform-load (ETL) workflow. Hadoop helps minimize the turnaround time for ETL, but batch processing still means something more akin to “tomorrow” than “real-time.” In contrast, Google’s original MapReduce paper gave an inspiring example of inserting MapReduce directly into a sequential C++ program for efficient real-time computation. Unfortunately, we find far too few real-time examples in production.
As a big data solution provider myself, I think we should set our sights much, much higher than simply handling more data in the warehouse. Instead, we should integrate the concepts of scalability, fault tolerance and efficient parallel computation into the very systems that drive the end-user experience. Many of the players in the nascent NoSQL database market actually aim to do precisely that, but those tools come with their own issues and effects that could fill an entirely separate post.
The emergence of Hadoop has gone a very long way toward democratizing big data, but big data spans across important issues that Hadoop alone doesn’t address. While large engineering shops such as Facebook are able to deal with these challenges, the intellectual and capital expenditures required for success reinforce that our revolution is still far from complete. As the market matures, I expect big data’s near future includes technological and market responses to address these increasingly important issues.
Mike Miller is co-founder and chief scientist at Cloudant.
Image courtesy of Krzysztof Poltorak.