A Strata news digest

The Strata + Hadoop World Big Data confab starts up in San Jose, CA this morning, and a bunch of vendor announcements are coming along for the ride. Some of the news broke just this morning, while a number of other announcements were made yesterday. Several others were made last week that I wasn’t able to cover. This admittedly long post aims to bring you up to speed on all of this news, in a categorized format. Here goes:

Pivotal
Pivotal, the data and cloud spinoff from EMC and VMWare, had so many announcements that I’m giving it a category of its own. The short version: Pivotal is getting the religion of open source, industry consortia participation and partnering.

On the open source front, Pivotal has decided that its HAWQ SQL-on-Hadoop engine and its GemFire in-memory NoSQL database, which have until this point been proprietary and compatible only with the Pivotal HD Hadoop distribution, should be made open source and compatible with other distributions. HAWQ is an implementation of the company’s Greenplum MPP data warehouse engine, re-engineered to run on the Hadoop Distributed File System (HDFS). Perhaps it’s for that reason that Pivotal decided that Greenplum should be made open source as well. That may be the most impactful announcement of the bunch; look for it to put other MPP vendors, including HP, IBM, Teradata and Microsoft, under some pressure to respond.

Interestingly, the open sourcing of Pivotal’s various Hadoop components has led the company to simplify licensing around the commercial versions of those same products. Enter the Pivotal Big Data Suite, which includes those components under a single subscription, which also includes entitlements to run it all on Pivotal’s Cloud Foundry technology. This move may help simplify customer’s views of Pivotal’s numerous offerings and increase cross-sell opportunities between the data-related products and the cloud-related ones. Any by the way, the Big Data Suite versions of HAWQ and Greenplum will feature updated query optimizer technology called Orca and management tools, which will not be included in the open source versions.

In the lead up to the announcements from Pivotal, some in the industry were wondering if the company would retire its branded Hadoop distribution. It seems such forecasts were off-base, at least for now. But the company is moving to make Pivotal HD better aligned with other distros. The company has joined forces with Hortonworks, IBM, SAS, Teradata, Splunk, Altiscale, GE, Verizon and others to establish the Open Data Platform (ODP), a baseline specification (the “ODP Core”) for standard Hadoop Distributions. By defining a cross-vendor baseline configuration that other vendors can write to, ODP aims to reduce the amount of fragmentation in the Hadoop ecosystem. Other vendors will be joining, but the absence of Cloudera and MapR is conspicuous.

In addition to its ODP membership, Pivotal announced a direct, bilateral partnership with Hortonworks, which entails joint engineering efforts, integration of Greenplum and Hortonworks Data Platform, and Hortonworks support services for Pivotal customers.

New releases
Pivotal isn’t the only company with new releases. Here are some more:

  • Last week, Datameer announced a new Professional product, which comes in cheaper than its Enterprise sibling, by way of a cloud subscription.  This cloud release runs on the Altiscale Hadoop as a service cloud in North America and the Bigstep cloud (using Cloudera’s CDH Hadoop distro) in Europe.
  • Speaking of cloud, Logi Analytics has announced that its Logi Vision package is now available on an hourly billable basis on Amazon Web Services’ marketplace (something Jaspersoft has been offering dating back to the days before TIBCO acquired it).
  • MapR has a announced a new release of its Hadoop distro, including a new enhanced version of its HBase-compatible MapR-DB.  This version includes cross-data center replication (XDCR) allowing geo-distributed operational applications to run on the MapR platform.
  • GridGain announced the 7.0 release of its in-memory data fabric (IMDF).
  • Paxata has announced the Spring ’15 release of its data prep platform, now using Apache Spark 1.2 (and a number of the company’s own enhancements to it) as the underlying data processing engine.

Predictive analytics
In the realm of predictive analytics, one established player in the space has a new release and a player form the adjacent data warehousing arena has a new entrant in the market

  • RapidMiner has a new release that embraces Spark as well. Specifically, RapidMiner’s Radoop component, which facilitates the scoring of RapidMiner models in Hadoop, now offers integration of algorithms from Spark’s MLlib (machine learning library), along with integration of Kerberos for data authentication on ingest. The new release also improves on the product’s Wisdom of Crowds functionality: now in addition to crowd-sourcing advice algorithm selection and model design, the product can lean on other users’ expertise for setting algorithm parameter values for a particular model as well.
  • HP is announcing its new Haven Predictive Analytics product. Much as Radoop allows RapidMiner to build and score models in-place within Hadoop, HP’s Distributed R product allows a similar distributed, in-place implementation within the HP Vertica data warehouse.

Apache Software Foundation projects and news
We’ve already discussed Pivotal’s excitement around open source and ODP, but back in the land of Apache Software Foundation open source projects, there’s notable news as well:

  • MapR has kicked off a new open source project called Myriad that aims to integrate Apache Mesos data center management layer and the YARN cluster-management component of Apache Hadoop. Myriad will be a MapR-led project for now, but the company has made explicit its aspiration to move the project toward Apache incubator status soon.
  • And speaking of the Apache Incubator, one project there, Ignite, now has a version 1.0 release candidate ready for download.  Ignite is based on the open sourcing of the GridGain IMDF product. The 7.0 release of that product (discussed above) and Ignite 1.0 are feature-aligned.
  • In other ASF news, Cloudera has renewed its sponsorship of the body and upgraded it to Platinum level.

Acquisition news
I’ve talked a great deal in my posts about the consolidation in the Big Data space. Microsoft recently announced its intention to acquire Revolution Analytics; Teradata has acquired Revelytix, RainStor, Hadapt and ThinkBig Analytics; Cloudera acquired Gazzang; Hortonworks acquired XA Secure; and TIBCO acquired Jaspersoft.

Now Pentaho, which once sat alongside Jaspersoft as a commercial open source software BI provider, is slated for acquisition as well, specifically by Hitachi Data Systems (HDS). Such an acquisition would mark a transformation of HDS from pure systems integrator (SI) to a company that offers both services and products. Granted, HDS already was already developing IP that it could implement across its various projects, but the acquisition of Pentaho will accelerate these efforts immensely.

Security
The integration of Kerberos security into RapidMiner’s Radoop component is a good example of the analytics world growing ever more sensitive to Enterprise standards and requirements. Beyond such security-sensitive moods, are two security-specific new product announcements:

  • Dataguise, which has offered powerful data masking technology for Hadoop and for relational databases like Oracle and SQL Server, has now announced the release of its dgSecure product for Apache Cassandra. The interesting thing about NoSQL databases like Cassandra is that their loosely- and un-schematized architectures make it difficult to secure data in the database catalog itself. Dataguise’s dynamic data masking (DDP) redacts certain data on the fly, based on pre-specified configuration or automated detection by dgSecure’s DDP agents.
  • Centrify has announced its Centrify Server Suite 2015, which integrates Microsoft Active Directory (and other LDAP servers) with Hadoop. Ostensibly, this would allow for a user’s enterprise single sign on-based identity context to be honored by Hadoop, making sure the user’s cluster access is correctly governed and that her queries are appropriately logged as hers, for audit purposes. Whether this technology is eventually integrated with that of ASF projects like Sentry and Ranger is not yet clear to me.
  • Which is a good segue to the BlueTalon Policy Engine, announced today. BlueTalon’s product enables fine-grained access controls similar to that of Sentry and Ranger, data masking functionality similar to Dataguise’s and audit features that have at least some overlap with Centrify’s. BlueTalon is also announcing $5M in funding, which ought to help ensure that the Hadoop enterprise security space is competitive, even as it’s emerging.

SiSense and Simba
There’s one last bit of news I wanted to cover that may at first seem minor: a partnership between SiSense (which offers a powerful self-service data discovery platform of the same name) and Simba (whose ODBC and JDBC connectors to Big Data and NoSQL platforms are the de facto standard). Through this partnership, SiSense now offers supported direct connectivity to data in MongoDB NoSQL databases.

Simba’s drivers essentially provide SQL access to NoSQL data. Tools like SiSense know how to issue SQL queries without requiring their users have such skills. The combination allows code-free access to MongoDB, a database which typically is queried using imperative code written in JavaScript (something arguably more difficult than declarative queries written in SQL).

Of course, providing such access to MongoDB requires data in its tables to be adhere to a structured rows-and-columns layout. While that may seem the antithesis to NoSQL, the reality is it’s a requirement for data discovery. Thus the worlds of relational SQL databases and loosely-schematized NoSQL databases are drawing closer. And that’s why this seemingly minor news gets three closing paragraphs of coverage.