1 Comment

Summary:

The Apache Mahout project will now support Apache Spark and another data engine called H20 as it tries to retain its status as the go-to set of machine learning libraries for Hadoop.

mantle-mahout

Apache Mahout, a machine learning library for Hadoop since 2009, is joining the exodus away from MapReduce. The project’s community has decided to rework Mahout to support the increasingly popular Apache Spark in-memory data-processing framework, as well as the H2O engine for running machine learning and mathematical workloads at scale.

While data processing in Hadoop has traditionally been done using MapReduce, the batch-oriented framework has fallen out of vogue as users began demanding lower-latency processing for certain types of workloads — such as machine learning. However, nobody really wants to abandon Hadoop entirely because it’s still great for storing lots of data and many still use MapReduce for most of their workloads. Spark, which was developed at the University of California, Berkeley, has stepped in to fill that void in a growing number of cases where speed and ease of programming really matter.

H2O was developed separately by a startup called 0xdata (pronounced hexadata), although it’s also available as open source software. It’s an in-memory data engine specifically designed for running various types of types of statisical computations — including deep learning models — on data stored in the Hadoop Distributed File System.

“[H2O] looks like a really good technology layer to drive a lot of what Mahout’s been missing and remove the artificial constraints that have been in Mahout’s way,” Ted Dunning, project management committee member for Apache Mahout and the chief application architect at Hadoop software vendor MapR, told me. “A combination of H20 and Spark could really be something,” he added.

SriSatish Ambati, the founder and CEO of 0xdata, noted that the data science community isn’t married to one computational framework over another as long as they get the job done. It’s the higher-level stuff, including how people program models, that really matters. H2o natively supports the R programming language, for example, which is rather popular and would be a new capability for Mahout, Dunning said.

One could argue that the Mahout community had to embrace Spark, at least, if it wanted to remain relevant. Already, Cloudera is working on its Oryx machine learning framework that was designed in order to overcome Mahout’s shortcoming and will be ported to Spark at some point. The Spark community itself is also working on a set of machine learning libraries called MLlib.

  1. I’d just like to add that the Mahout community has discussed a move away from MapReduce for a very long time. But until recently no stable new system under an attractive license was available (there were some discussions to integrate Apache Giraph, but we felt that the programming model was too constrained for general ML). With the recent graduation of Apache Spark as toplevel project this situation has changed.

    Share

Comments have been disabled for this post