Microsoft announces numerous Azure data releases; HBase finally turns 1.0

After last week’s somewhat encyclopedic (and I mean that self-effacingly) Strata news roll-up, this week’s post focuses on just a couple of items: Microsoft’s collection of cloud data announcements (which also broke at Strata, but did so after my deadline last week) and the version 1.0 release of Apache HBase.

The Microsoft news was multi-faceted. Here’s a summary:

Machine learning and Hadoop
The Azure Machine Learning GA is a big deal for Microsoft. It’s also a big bet, and one that the company doubled-down on when it announced the acquisition of Revolution Analytics. If that doesn’t tell you that Microsoft is serious about the R programming language and the people who use it, then nothing will. And Microsoft’s support for Python in Azure ML makes the cross-platform thrust especially concrete.

On the HDInsight side, the GAs are less exciting, but they’re also super-important. Keeping up with the latest Apache Hadoop bits is imperative if Microsoft wants the Hadoop community to take HDInsight seriously. Bringing Apache Storm support to GA shows Microsoft is serious about streaming data applications.

Cluster scaling (whereby nodes can be added to, or removed from, an HDInsight cluster while it’s running) and node size selection give customers the kind of control they’d get building their own clusters on Azure’s Infrastructure as a Service (IaaS) platform while allowing them to stay on the much more straightforward Platform as a Service (PaaS). It’s part of what I believe to be Microsoft’s overall strategy of putting IaaS and PaaS together along a spectrum, thus transcending the dichotomy between them.

SQL DB v12: smells like SQL Server 2014
Switching over to relational databases for a moment, the GA of Azure SQL Database v12 (which was actually announced earlier in the month) extends that dichotomy transcendence to the on- and off-premises divide. Why? Because v12 is the first version of SQL DB that delivers substantial feature compatibility Microsoft’s SQL Server on-premises database product. In, fact SQL Server 2014 is version 12 of SQL Server (run the query SELECT @@VERSION if you don’t believe me).

Microsoft hearts Linux, and so does HDInsight
Now back to Hadoop: what the devil is Microsoft doing bringing out a Linux-based version of its own PaaS Hadoop service?  Sure, buy Revolution Analytics, the key commercial entity behind R. Cater to the Python community before your own F# faithful. It’s an earnest gesture. But if someone needs Linux-based Hadoop, why not point them to the Azure VM images running Cloudera’s CDH distribution or Hortonworks’ HDP?

The answer is twofold: first, to be relevant to the Hadoop community and compatible with many of its tools and techniques, Linux is a must; and, second, once again, Microsoft is seeking to make IaaS and PaaS complementary choices along a spectrum, rather than an us-versus-them proposition.

Beyond the strategy, there is some very concrete usability here, which I can relate first-hand. I was able to spin up a Linux-based HDInsight cluster myself this weekend (I did this while the rest of the world was watching the Oscars, but I digress). Provisioning the cluster was easy and I expected that, but what I was nonetheless unsure of was what the experience would be like afterwards.

Feeling right at home
As it happens, all of the experience I had using Amazon’s Elastic MapReduce (EMR — which, of course, is also Linux-based) carried over directly. In very little time, I connected to my cluster using PuTTY, the de facto standard Windows SSH client, and did some quick work in Hive and Pig. Since I set the cluster up as a standard Hadoop cluster, HBase was not installed. But since HDInsight for Linux automatically includes Apache Ambari, I was able to use its browser-based interface to add HBase in a few clicks.

I’d never even used Ambari before, but since setup of the cluster was so straightforward and since there was a button in the Azure portal to launch Ambari, I was able to figure it out, as a relatively incremental self-training task. That’s exactly what a good cloud offering should do. Now I can take that knowledge back to an on-prem cluster and apply it there. Again, a divide is bridged.

HBase officially not in Kindergarten
And speaking of HBase, that Apache project officially pushed out its 1.0 release yesterday. In the world of Apache projects, a 1.0 release is far from the first, but it is a watershed release that typically represents what the project management committee (PMC) feels is a solid, stable release that is ready for mainstream adoption.

That said, HBase has already seen mainstream adoption…a lot of it, in fact, so this 1.0 release may be a non-event. But my take is it marks a coming of age for NoSQL, and of an approach that makes NoSQL into a service that other databases (including even relational ones, like Splice Machine) can use.