Although the first couple years of the commercial Hadoop era have been characterized by an attitude of ”Hadoop is great, but …”, the tone is changing as Hadoop vendors make the platform more palatable with each iteration. No longer is a Hadoop job necessarily an epic undertaking rife with pitfalls; if you know how to leverage it, at least, Hadoop is becoming as reliable as any other data-management tool on the market.
This sort of evolution was to be expected, of course, because Hadoop is an open-source Apache Software Foundation product whose commercialization is still in its infancy. Early adopters within Yahoo, Facebook and other web properties had the time and software-engineering skills to work around its performance, scalability and management shortcomings, but many mainstream organizations did not. With the advent of Cloudera, Hortonworks, MapR and any other number of Hadoop-based vendors, however, things began to change.
Hadoop is now compatible with just about every other piece of data-management software, so companies can integrate their SQL, Hadoop, NoSQL data environments (e.g., SQL, Hadoop, NoSQL, integration tools, etc.) without having to build those connections themselves. Management has been made easier thanks to tools from numerous software vendors, and Hadoop clusters even can be rolled out as part of preconfigured hardware appliances. Organizations hoping to get the most out of Hadoop still need the skills to write MapReduce jobs, but even that’s getting easier thanks to a variety of products designed to simplify that process, some of which are based entirely in the cloud.
Two recent product announcements further illustrate Hadoop’s increased maturity: Rainstor’s new Big Data Analytics on Hadoop and version 2.0 of the Hortonworks Data Platform. Rainstor’s product is a database that stores data in the Hadoop Distributed File System but lets users choose between querying data using either MapReduce or SQL depending on what’s the right tool for the job. It also compresses data volumes by up to 90 percent, meaning users have smaller Hadoop clusters and can run faster queries.
Rainstor CEO John Bantleman told me during a recent interview he thinks “there will be an aversion to spending $50 million to solve the petabyte problem.” Hadoop is the answer, but, he added, Rainstor’s large customers in banking, media and telecommunications require more from Hadoop than what they can get from an open-source project.
Below the database level where Rainstor resides, the latest version of the Hortonworks Data Platform makes the Hadoop cluster itself a more reliable and more flexible system. It’s based on the .23 release of Apache Hadoop, which is a landmark release because it integrates solutions to both scalability and availability woes within HDFS, as well next-generation MapReduce.
To a large degree, next-generation MapReduce is one of the more-important developments in big data as a whole because it lets Hadoop natively plug into alternative processing engines. While MapReduce is great for batch processing, it’s not great for more-real-time applications, and it still requires writing MapReduce jobs. HDFS, on the other hand, has proven effective as a scalable data store that can underpin alternative products. These could range from database projects such as Hive, HBase and even Rainstor’s new product, to engines that can process data as it streams into the system.
I’d be a fool to say Hadoop is a finished product and that no more work is necessary, but it does look like most of the major wrinkles will be ironed out thanks to the current wave of innovation. Even if it ends up lasting a decade, we’re in the beginning of Hadoop’s golden era and can now catch a glimpse of what the technology will look like when it finally reaches the level of ubiquity many think it will attain.