Blog Post

NetApp does network-attached Hadoop

Seeking to appease enterprise customers demanding more-reliable and efficient Hadoop clusters to power their big data efforts, NetApp (s ntap) has partnered with Cloudera to deliver a preconfigured Hadoop storage system. Called the NetApp Open Solution for Hadoop, the new product combines Cloudera’s Hadoop distribution and management software with a NetApp-built RAID architecture.

As we’ve explained here before, Hadoop is great for storing and processing large quantities of unstructured data, but it does involve a fair amount of operational complexity to keep it running smoothly. As Cloudera’s head of business development Ed Albanese explained to me, because Hadoop architectures generally involve the computing and storage layers residing on the same commodity servers, it can be tedious and difficult to pull and replace a server if a disk goes down. And although that architecture does result in relatively low costs and high performance, it also can be power-hungry, because users are left scaling both layers when they might only need to scale one of the two.

And clusters have been growing, presumably to keep pace with fast-growing data volumes. In July, Cloudera’s Omer Trajan noted that customers’ average cluster size had grown to more than 200 nodes, and that 22 customers were managing at least a petabyte of data in their clusters.

The ultimate goal of the new NetApp product, Albanese said, is threefold: 1) to separate the compute and storage layers of Hadoop so each can scale independently; 2) to fit with next-generation data center models around efficiency and space savings; and 3) to improve reliability by being able to hot-swap failed drives and otherwise leverage NetApp’s storage expertise. Cloudera will actually start shipping a version of its Cloudera Enterprise management software designed specifically for this system. That will make it easier for customers to monitor storage performance and know when to add or replace drives.

Jeff O’Neal, NetApp’s senior director of data center solutions, added that his company’s foray into Hadoop will maintain performance levels even though data is now traversing the network to get to the compute nodes from the storage system. The data and compute loads are still logically connected, he said, and the storage layer maintains the Hadoop Distributed File System’s native shared-nothing architecture.

The product comes at the right time, as Apache-centric versions of Hadoop have come under fire from newcomers such as MapR and its partner EMC (s emc), which claim they can deliver better performance, reliability and availability than can HDFS. For enterprise customers that prefer the open-source nature of Cloudera’s Hadoop distribution but are rightfully concerned about HDFS reliability because they’re running production Hadoop workloads, the NetApp solution likely will provide a welcome point of comparison.

Interestingly, NetApp first mentioned its NetApp Open Solution for Hadoop when Yahoo (s yhoo) spinoff — and Cloudera competitor — Hortonworks launched in June. NetApp signed on as a Hortonworks partner early, claiming the new system would support the Hortonworks Hadoop distribution. O’Neal declined to comment on the Hortonworks situation, although considering that Hortonworks just announced its first software last week, it’s possible NetApp will still add a Hortonworks edition of its “open solution” when those products are ready for production.

Feature image courtesy of Flickr user miheco.

2 Responses to “NetApp does network-attached Hadoop”

  1. Hi Derrick,

    To echo Tomer’s comments, the NetApp Open Solution for Hadoop (NOSH) v1.0 only uses network-attached storage (NAS filer as per Cloudera best-practices) for the metadata (NameNode HA). All NOSH-attached DataNodes are directly server attached via SAS cabling without any storage switch (or Ethernet for storage-layer traffic) whatsoever. This helps preserve the shared-nothing nature of HDFS DataNodes.

    NOSH customers (particularly large ones) will find improved data locality / reduced fragmentation of partitioned HDFS data over time due to elimination of non-deterministic replication between DataNodes during recovery of disk or server failures.

    We’d love to hear specific HDFS functional requirements around snapshots, mirroring / replication and performance desires from developers and practitioners. We hope to publish NOSH performance numbers soon, as we’re seeing great results in our labs so far, over & above phenomenal improvement in query & job completion during degraded mode (recovery from disk failure)

    -Val Bercovici.
    NetApp Cloud & BigData Czar

  2. Tomer Shiran

    Good article. Three comments:
    1) Leveraging NetApp hardware for the individual node does not make HDFS any more reliable, and does not address any of the key issues that MapR has solved. There is no HA (NameNode is a single point of failure), there is no point-in-time recovery (snapshots), and there is no disaster recovery (cross-cluster mirroring). It also doesn’t address the other advantages of the MapR/EMC solution, such as random read/write, NFS access and higher MapReduce performance.
    2) “Separating the compute and storage layers of Hadoop” is specified (by Ed) as desirable. The truth is that Hadoop is specifically designed to leverage direct attached storage. While HDFS may have had issues with handling individual disk failures, other distributions (e.g., MapR) can deal with a disk failure with no issue.
    3) Cloudera Enterprise is not open source. (In fact, CDH is no longer open source either, with SCM Express, data warehouse connectors, and Hive ODBC driver all being closed source.)