Weekly Update

MapR to Hadoop market: Drill baby, Drill

MapR announced a new 4.0.1 release of its Hadoop distro this week, and Apache Drill –which GA’d its version 0.5 release last Friday – is in the box. But that’s not all. The Apache Cassandra project pushed out its 2.1 release last Thursday, Aerospike introduced a startup special, a new data analytics company was born and IBM announced its Watson Analytics product. The summer slow period is definitely over.

This is not a Drill
First things first: MapR and Drill. While plenty of folks obsess on the Cloudera-Hortonworks “Haduopoly,” MapR continues to develop a very competitive distribution that focuses on operational applications with a read-write file system, an HBase-compatible database that is optimized for it and integration of Apache Spark. To that smorgasbord, MapR is now adding Apache Drill.

Drill, an Apache incubator project partially inspired by Google’s Dremel technology (commercialized as Google BigQuery), provides a columnar, ANSI-compliant SQL-on-Hadoop solution that can query Hive tables, HBase tables and files in plain text, JSON and Apache Parquet formats.

While Drill may sound like it’s just another SQL-on-Hadoop engine (joining MapR’s four other offerings in that category – Hive, Impala, Spark SQL and integration with the HP Vertica data warehouse platform), it goes beyond that. Drill’s dialect of SQL includes extensions that allow for easy querying of the kind of hierarchical/nested data that is very common in NoSQL databases like HBase, MongoDB and straight JSON files as well.

Hadoop in spirit
That’s Drill’s raison d’etre: providing a unified query engine for disparate data types, including the ability to query files without first requiring them to be ingested into a specific database. Casual connectivity to heterogeneous data, and connecting to that data where it sits, is very much in sync with the Hadoop ethos.

Drill may be late, it may seem superfluous, and may still be lacking important features, ease of use and seamless deployment. But Drill’s 0.5 release (a stable beta, essentially) is nonetheless a very important event for the Hadoop ecosystem. Whether or not Drill itself gains traction, its query paradigm is important and will likely influence the design and feature set of other SQL-on-Hadoop solutions.

Cassandra’s important dot release
Moving on, Cassandra 2.1 is out, and it brings some important features of its own. These include performance optimizations derived from a new implementation of counters, indexes on collections, user-defined types and full Windows operating system compatibility. It will take a while for DataStax Enterprise to on-board the Cassandra 2.1 codebase, but DataStax is working to cut that lag time significantly.

Aerospike woos startups
Drill and Cassandra are open source Apache Software Foundation projects. In-memory database Aerospike, though originally a commercial closed source product, was recently open sourced as well. That lowers the barrier to entry, especially for curious developers and cash-strapped startups. Thing is, to be truly useful to the latter group, access to enterprise editions is needed as well.

With that in mind, Aerospike has announced a new startup special program, providing Aerospike Enterprise Edition (with no limits on nodes, data velocity or data volume) to startups that have less than $20M in funding, and less than $2M in annual revenue. Given the competition from other open source database products, this seems a good move for Aerospike.

ML Spinoff
Security-focused data science firm Qbase has very powerful and interesting technology. Thus far, most of it has been tied up in applications aimed at specific verticals, including government, military and national security.

But if the technology on its own is so good, why keep it captive to certain industries? With that in mind, Qbase announced on Tuesday the creation of a new division, which it intends eventually to spin off, called Synthos Technologies. Synthos will launch general purpose products that leverage Qbase’s real-time big data machine learning and make it available to users across all industries.

Elementary Watson
And on the topic of general purpose products that use advanced machine learning technology, IBM announced, also on Tuesday, that it will democratize access to its Watson technology through a new hosted freemium service.

Called Watson Analytics, the service is aimed at business power users who want to leverage IBM’s hosted access to Watson, and do so using a self-service data discovery, guided predictive analytics user interface. The service is not actually available yet, but interested parties can sign up to be notified when it is.

Competition good for data consumers
Drill, Cassandra, Aerospike, Synthos and Watson each offer their own ways to query your data that go well beyond what conventional relational databases (be they designed for transactional or data warehouse workloads) ever offered. Every single one of these products and companies are searching for new users, mostly from the same overall pool of potential customers.

This competition is yielding a huge array of options at an incredibly rapid pace of innovation. We’re really in a golden age of data and analytics. Next month’s Strata/Hadoop World event in New York will likely bring a slew of new announcements that drive that point home even more poignantly.