The recent excitement around Hadoop has culminated in five new Hadoop products today from EMC, NetApp, Mellanox, SnapLogic and DataStax. That large technology vendors with hardware expertise are pushing gear optimized for Hadoop highlights the potential riches to be made in big data analytics, and also suggests that mainstream Hadoop adoption is on its way.
New Hadoop Hardware
Hadoop development thus far has focused largely on improving the open-source, data-processing software tools that come from the Apache Hadoop project and are designed to run on clusters of commodity clusters, but with EMC and NetApp’s involvement, the big guys are getting involved with specialized Hadoop gear.
NetApp is getting into the Hadoop game with a new lineup of storage products dedicated to serving high-bandwidth, high-performance applications, such as Hadoop workloads. The new E-Series lineup comes from technology obtained in NetApp’s recent Engenio acquisition, and the Hadoop-focused version is the E2600. Part of Hadoop’s appeal is that it can host storage and processing on the same commodity servers, but NetApp’s product takes a more traditional — and higher-margin — approach of separating the storage and processing tiers. However, the E2600 does offer a SAS interface for attaching directly to computing servers, as well as iSCSI and Fibre Channel interfaces for SAN environments. For organizations requiring high storage reliability and scalability distinct from the computing tier, this could be a good approach.
High-performance networking pioneer Mellanox is looking to increase throughput in Hadoop clusters via its ConnectX-2 adapters with Hadoop Direct. I’ve highlighted before how big data applications such as Hadoop will require more network bandwidth to achieve higher performance, and Mellanox claims the total job runtime was cut in half when using its ConnectX-2 and Hadoop Direct setup on InfiniBand or 10 Gigabit Ethernet vs regular Ethernet.
Later this morning, EMC is releasing its first-ever Hadoop-based product, which will fall under the company’s big data division that’s anchored by the Greenplum analytic database. The Dow Jones Newswire reported last week that EMC will package Hadoop into its Greenplum analytic appliances, and possibly will release its own Hadoop distribution. I’ll be at EMC World later and will have more on EMC’s potentially game-changing plans once they’re announced.
On the Software Side
DataStax Brisk, which was announced during our Structure: Big Data conference in March, is now available as a beta release. As I reported at the time, Brisk is a new Hadoop distribution that replaces the standard Hadoop Distributed File System (HDFS) storage component with the open source Cassandra NoSQL data store. Whereas Apache Hadoop is designed for batch processing, DataStax claims Cassandra makes it easy to store, process and serve application data in near real time. Additionally, says DataStax VP of Product Management Ben Werther, Brisk uses the Cassandra API, which removes much of the complexity of setting up Hadoop environments with HDFS.
Finally, SnapLogic is looking to improve processing of SaaS application data with a new product called SnapReduce. According to the press release, “SnapReduce transforms SnapLogic data integration pipelines directly into MapReduce tasks, making Hadoop processing much more accessible and resulting in optimal Hadoop cluster utilization.” It uses HDFS as the repository for all data, meaning data streams from a company’s SaaS application to HDFS, where it is then processed using Hadoop MapReduce. Data integration has been a hot area for Hadoop lately, as both SyncSort and Pervasive Software have been working on products to ease the process of creating MapReduce workflows.
I noted in late March that the “Hadoop war” was underway, and it’s spreading further with every passing week. Vendors are putting a lot of money in a market where relatively little money is changing hands right now, but it suggests their customers are telling them, “If you build it, we’ll come.”