Weekly Update

Splice Machine goes GA, and other short stories

Yes, news broke early last week of one big data company’s pending IPO, but in the week since then, new builds, a General Availability release, and new products have been announced. Let’s take an inventory:

Splice Machine goes GA
Splice Machine, a fully relational database that is implemented atop Apache Derby, Apache Hadoop, and Apache HBase, hits version 1.0, its general availability release, today. This GA version ships with streamlined Hadoop MapReduce integration and LDAP integration (which ostensibly would include Active Directory), among other features.

Splice Machine 1.0 also adds SQL analytic windowing functions, an important set of SQL language constructs that are implemented in conventional relational databases as well. Leonard Lobel, with whom I co-authored a book about Microsoft SQL Server, has an excellent two-part blog post on how windowing functions work, here and here.

Splice is pretty serious about offering a true ACID-compliant, SQL RDBMS that nonetheless offers Hadoop’s simple horizontal scaling model. And at least one Hadoop company is taking the effort seriously: Cloudera and Splice Machine have formed a partnership.

SQL-on-Hadoop: It’s all about Vertica integration
The folks at HP are serious about the combination of SQL and Hadoop too. The company announced on Monday the release of HP Vertica for SQL on Hadoop (let’s call it “VSH”), version 1.0. The product is Hadoop distribution-agnostic, and provides a SQL query layer over data stored in the Hadoop Distributed File System (HDFS).

I’m not clear whether VSH is a standalone engine over Hadoop, akin to Impala or Pivotal Hawq, or if it’s Hadoop bridging technology for Vertica, like Microsoft’s PolyBase or Teradata’s QueryGrid. The former two let you query Hadoop data head-on; the latter two let Hadoop data appear as virtual tables inside an otherwise conventional data warehouse. Though I’ve yet to speak directly with Vertica, I’ll get to the bottom of this yet.

Speaking of SQL-on-Hadoop, open-source project Apache Drill — inspired by Google’s Dremel/BigQuery technology — released its version 0.6 on Friday. The project team characterizes this as the second Beta release of the project (0.5 being the first). Drill offers a SQL interface-to-data stored in Hadoop. While that may at first blush sound superfluous given what Hive, Impala, and other SQL-on-Hadoop projects deliver, Drill is a legitimately different beast.

Drill can query most any file on demand, independent of any schema information being stored in the Hive metabase. Drill can also query hierarchical data, including JSON files, and offers a dot-separated syntax extension to SQL for doing so (much like BigQuery’s). In addition to generic JSON access, the Drill 0.6 release notes explain the new version adds the ability to query data stored in MongoDB, which means it provides a SQL-on-Mongo solution — something noteworthy in itself.

CouchDB lets loose and gets clustered
Speaking of MongoDB, its document store companion, Apache CouchDB, is undergoing some renovations. The project sprouted derivatives in Couchbase and Cloudant (now part of IBM), which added scale-out, clustering capabilities to the technology. On Monday at ApacheCon Europe, the CouchDB team announced that version 2.0 of the project is available in a developer preview release now, and adds clustering of its own.

That 2.0 release also adds a faster compactor and replicator, easier setup and a new browser-based admin interface, “Fauxton,” to accompany the existing “Futon” browser-based data entry UI. The team has also shared that compatibility with MongoDB’s query syntax is on its roadmap, beyond the general availability release of CouchDB 2.0, in early 2015.

And speaking of Cloudant, IBM announced that a new Cloudant-compatible version of DashDB, its cloud data warehouse offering, is now available.

And one more thing distro
Ten days ago, Teradata and Cloudera announced an integration and reselling partnership allowing Cloudera’s Hadoop distribution to be sold with, and run on, Teradata appliances, side-by-side with Teradata’s own data warehouse product. The partnership, which is very similar to one Teradata has with Hortonworks, has now also been replicated with MapR. That company announced the new partnership this morning.

Beyond Cloudera CDH, Horton HDP and MapR, the other bigger Hadoop distributions, IBM InfoSphere Big Insights and Pivotal HD, are both from companies that offer MPP data warehouse platforms (Netezza and Greenplum) that compete with Teradata, so similar partnerships there seem unlikely. Meanwhile, Teradata now has partnerships with all three Hadoop pure-play companies. This lets Teradata interoperate with, ship and support whichever of these three distros a customer may prefer to work with.

The goose is getting fat
I’m eager to see if this be the last batch of new Big Data/NoSQL releases before the holidays, or if we will see another slate of ship announcements between turkey day and Christmas. Meanwhile, I may download Splice Machine 1.0, or CouchDB 2.0, and see what’s what.

UPDATED to correct Splice Machine feature list and partnership details.