Weekly Update

The Strata news begins

The Strata + Hadoop World confab kicks off in New York today (with major sessions and keynotes starting tomorrow) and major Big Data news is already forthcoming. Below is a summary of what’s been made public thus far.

Cloudera adds Impala 2.0, streaming and new Director component
Cloudera yesterday announced version 5.2 of its Cloudera Distribution including Apache Hadoop (CDH), which includes a new version (2.0) of its Impala SQL-on-Hadoop engine and new real-time/streaming capabilities based on Apache Flume, Apache Kafka, the Spark Streaming component of Apache Spark as well Impala and Apache HBase. Also in the box: a completely new component called Cloudera Director which provides a user interface and management console for automated deployment of CDH-based Hadoop clusters to public cloud platforms. Version 1 of Director will support Amazon Web Services’ Elastic Compute Cloud (EC2) Infrastructure as a Service (IaaS) platform. The Director Roadmap calls for the addition of support for IBM Softlayer, CenturyLink and T-Systems cloud platforms. Support for Google Cloud Platform, Rackspace and GoGrid are in early stages of development, Cloudera told me.

Impala 2.0 adds support for most ANSI SQL commands, as well as some vendor-specific extensions. SQL GRANT and REVOKE permissions have also been added to Impala, enabled by integration with Apache Sentry. Speaking of Sentry, the project has been enhanced to feature a plug-in architecture to enable it to work across the Hadoop stack, rather than just with Hive and Impala. One such plug-in — designed to work with the Apache Solr-based Cloudera Search — has been developed thus far.

Hortonworks goes gangbusters with YARN, further enhances Hive
Hortonworks is announcing today the release of its Hortonworks Data Platform (HDP) 2.2, which includes Apache Spark running on YARN in a form that is now deemed “Enterprise-Ready” by Hortonworks, rather than “preview.” YARN-integrated versions of Apache HBase, Apache Accumulo and Apache Storm are included, as are Apache Kafka, Apache Solr and Cascading. Also added is the ability to perform data replication and archiving on Microsoft Azure Storage or Amazon Web Services Simple Storage Service (S3). Speaking of Azure, Hortonworks will now support running HDP for Windows on Azure virtual machines (the Azure IaaS offering), no longer limiting customers to using Microsoft’s HDP-based HDInsight cloud Hadoop service. Hortonworks is also announcing that Windows and Linux installs of HDP will now ship simultaneously.

And just as Cloudera has enhanced Impala, Hortonworks is continuing its “Stinger” initiative to enhance Apache Hive. Phase 1 deliverables of this second-gen Stinger initiative are shipping with HDP 2.2, and include the ability to perform SQL INSERTs, UPDATEs and DELETEs in Hive. In other words, Stinger is transforming Hive from a read-only SQL layer to a read-write engine, and one that supports ACID transactions, at that. Phase 2 will add sub-second query performance and integration with Spark’s machine learning capabilities; phase 3 will add support for SQL 2011 analytics, materialized views, queries that federate cluster nodes across geographical boundaries, and more.

And just as Cloudera is pushing Sentry-based security capabilities, Hortonworks is promoting project Argus, the open-sourced Hadoop security extensions from XA Secure that Hortonworks acquired earlier this year, and which the company has proposed to the Apache Software Foundation as an incubator project. The team has already developed integration with Apache Knox, Storm, Hive and HBase. It’s also developed key management and transparent encryption for HDFS.

Hortonworks is also adding views and blueprints for Apache Ambari and the ability to perform rolling upgrades (with no downtime) on HDP clusters.

MapR-DB for all my friends
MapR, meanwhile, is announcing today that its HBase-compatible NoSQL engine, MapR-DB, will now be available in the Community Edition of MapR’s Hadoop distribution. This database, which is optimized for MapR’s read-write implementation of Hadoop’s Distributed File System (HDFS) and written close to the metal in C++, has driven some important momentum for MapR and the company seems to be returning the favor by making the product available to all. Regardless, MapR-DB is not open source, but it is nonetheless available om the Community Edition of MapR’s distro.

Partnerships galore
Partnerships are cropping up everywhere at this Strata show. EMC is announcing a partnership with Cloudera that will enable CDH to use EMC Isilon as its storage layer, rather than having to use a direct-attached storage drive-based HDFS implementation. Cloudera also announced a partnership with Teradata which looks very similar to the one Hortonworks has with the data warehousing stalwart. Cloudera has also announced it will be the exclusive provider of Hadoop services on the CenturyLink cloud platform. Furthermore, the partnership with Dell and Intel, announced previously, has resulted in products which are now in general availability.

Tableau is partnering with a few companies of its own, and is doing so through concrete connectors to their products. These connectors enable integration of Tableau with Amazon’s Elastic MapReduce (EMR) cloud Hadoop service, and the combination of IBM’s InfoSphere BigInsights Hadoop distro along with Big SQL, its SQL-on-Hadoop technology. Since Tableau already has connectivity to CDH, HDP, MapR and PivotalHD, the addition of BigInsights gives it connectivity to every major Hadoop distribution on the market. Tableau’s connector to EMR is a beta release, as is that of one other connector, which we’ll discuss in a moment.

Predixion software (whose new release is discussed below) announced its own partnership with Salesforce. This partnership results in integration between Predixion’s software and the Salesforce Analytics Cloud, specifically geared towards helping health care organizations leverage predictive analytics for population health management.

R we there yet
Revolution Analytics is the company that offers a scale-out/clustered version of the R statistical programming language (and a number of accompanying packages and components), under the name Revolution R Enterprise (RRE). Today the company is announcing an open source version of RRE, called Revolution R Open. Those wishing legal indemnification and a support subscription for R Open will find it available under the name Revolution R Plus.

The embrace of open source R by Revolution means that a multi-threaded, highly parallelized version of the language (based on Intel MKL) is now available to all. Revolution claims that Revolution R Open is up to 20x faster than vanilla open source R. It also includes these validated and tested components: Reproducible R Toolkit, ParallelR, RHadoop and DeployR Open.

Predixion (4.0) comes true
Predixion Software, which has for some time offered a user-friendly predictive analytics package that is hosted in Microsoft Excel, is announcing version 4 of the product, Predixion Insight. This release adds a browser-based thin client, and further enhances the ability to leverage predictive models from Microsoft’s Analysis Services Data Mining engine, Apache Mahout and the R programming language. Predixion Insight also integrates with PMML (the Predictive Model Markup Language) and continues to enhance its own Machine Language Semantic Model (MLSM) to enable the deployment of predictive models to numerous platforms.

Sparks fly
The mainstreaming of Apache Spark shows no sign of letting up. Tableau is announcing the beta release of its Spark SQL connector, MapR has announced an initiative to integrate Apache Drill with Spark (providing another SQL on spark option, but one that does not require prior declaration of table schemas) and BlueData announced yesterday a Databricks-certified Spark distro for its EPIC private Big Data cloud platform.

And some other new release news
As an epilog, we should point out that self-service Big Data transformation vendor Trifacta announced on Thursday the 2.0 release of its self-named product. This version includes support for ingest of data in Avro, ORC and Parquet formats, along with integration for Cloudera’s Impala and, yes, Apache Spark. Trifacta 2.0 also adds a powerful Visual Data Profiling feature which the company demoed for me last week.

And one more piece of news from last week is germane, especially since its impact hits this week. Predictive analytics vendor RapidMiner announced that its RapidMiner Cloud service would launch tomorrow, October 15th. The cloud service enables predictive analytics on more than 300 cloud platforms including Amazon, Salesforce.com, Twitter, Dropbox and Zapier, according to the company’s press release.

How about some analysis?
My weekly updates are supposed to have a word limit, and I’m already well beyond it, so I’ll limit this post to an enumeration of news items. I suspect there will be more such announcements during the remainder of this week, and I myself am scheduled for no fewer than 18 vendor briefings at Strata. Suffice it to say, I’ll be back next week detailing more announcements, and presenting an array of thoughts across all the news from this week.