Weekly Update

Apache Big Data releases continue unabated

Last week, I wrote briefly about Apache HBase pushing out its 1.0 release. Subsequently, news of several more new releases of Big Data-related projects rolled out of the Apache Software Foundation. I think it’s important to take stock of these releases, less for the sake of knowing the details of each, and more for the purpose of discerning some interesting trends in the greater Hadoop ecosystem overall.

The new releases include the twin 5.0.0 releases of search project Apache Lucene and its Solr sub-project; the 2.3 release of the Apache Parquet column store file format (which is still an incubator project); a rather asymptotically-numbered 1.99.5 release of Apache Sqoop, which offers an abstraction layer around MapReduce for moving data between HDFS and various data warehouse platforms; and the 0.9.4 release of an interesting Apache incubator project called MRQL.

Solr power
Lucene and Solr have been around for a long time and Lucene has an important relationship with Hadoop: Cloudera Chief Architect Doug Cutting is the creator of both. Beyond that, Cloudera, Hortonworks and MapR include a search interface for Hadoop that is built upon Solr and Lucene and can index data stored in HDFS – Hadoop’s file system.

The 5.0.0 releases have a number performance and architecture enhancements that together seem to constitute a through house cleaning and modernization of the platform. As Lucene becomes embedded in ever more product, projects and engines, this is good to see.

Smooth…creamy…butter!
Speaking of file systems, and of Cloudera, the Apache Parquet file format, which allows for storage of column store data in simple files, is growing in importance. Announced almost two years ago, when it was a joint project between Cloudera and Twitter, the Parquet format became an Apache Incubator project this past May. Parquet is designed specifically for Hadoop, and effectively serves as Impala’s native file format. But other Apache projects are compatible with it, including Hive, Pig and Drill.

As Parquet has Cloudera’s imprint on it, we probably shouldn’t be too surprised that another column store file format, called ORC (Optimized Row Columnar), is out there as well, and has been largely driven by Hortonworks. ORC is essentially Hive’s native file format, succeeding the RCFile (Record Columnar file) format in that role. There are a lot of commonalities between Parquet and ORC, and convergence of the two, though unlikely, wouldn’t be the worst thing for the ecosystem.

Two Sqoops of data
Apache Sqoop (a contraction of SQL-to-Hadoop) is an interesting animal. At a time when Hadoop is moving away from MapReduce and data movement is increasingly considered something to be avoided, Sqoop continues as a MapReduce-based data movement tool. At version 1.99.5, it would seem that we’re almost ready to welcome in the era of “Sqoop 2.0.” But the Sqoop2 project, which adds a UI and greater manageability to Sqoop, was kicked off over three years ago and yet the Sqoop Web site says that Sqoop 1.99.5 isn’t feature-complete or intended for production deployment. So, as with the execution of bulk data transfers, it looks like we’ll need to hurry up and wait.

I believe in MRQL
The final Apache Big Data project to herald in a new release is MRQL (pronounced “miracle”), which provides a SQL abstraction layer over several different distributed computing platforms, including Hadoop MapReduce and Apache Spark, which many are familiar with, as well as Apache Hama and Apache Flink, with which I dare say many are not.

I will admit quite plainly that I am not learned in the ways of MRQL, but I find some things interesting about the engines on which it can run:

  • MapReduce 2.0 and Flink run as YARN applications on Hadoop
  • Spark and Hama can both run as YARN applications as well, but can also run on Apache Mesos

The take-away?  YARN is at the center of a lot of things. But so is Spark. Spark doesn’t require YARN or Hadoop, and neither does Hama, MRQL or anything that can run on either of them.  Could YARN ironically facilitate a loss of momentum for Hadoop, as it brings cluster-type-agnostic engines into the Hadoop circle which then woo users out of that circle because of their own allure and self-sufficiency?

Big Data: structure
After attending the Strata + Hadoop World conference in San Jose a couple of weeks ago, my Gigaom News colleague, Derrick Harris, opined that for now, Spark looks like the future of Big Data. I’m less convinced, but as we head into Gigaom’s own Structure: Data conference in New York the week after next, I’ll be giving this a lot of thought, especially while I have the pleasure of interviewing Spark co-creator and Databricks CEO Ion Stoica on stage at the event.