Hadoop doesn’t have to be so hard, just ask Etsy, Airbnb and the Climate Corporation. All three, it turns out, are using the Cascading framework atop Amazon Web Services’ Elastic MapReduce service to make creating and running big data jobs simpler than is possible using Hadoop alone.
Cascading is an open source Java framework that acts as an intermediary between users and Hadoop. Users create data workflows using Cascading’s Java-compatible APIs (rather than writing Hadoop MapReduce jobs), and it handles the task of making Hadoop process the data. Cascading is backed by a commercial entity called Concurrent (see disclosure), which is headed up by creator Chris Wensel, and is the foundation of several variations including Cascalog (a Clojure-based query language for Hadoop) and Scalding (Twitter’s Scala API for Hadoop).
Elastic MapReduce is AWS’s on-demand service that gives users access to cloud computing resources, rather than a collection of physical servers, for running Hadoop workloads. Writing MapReduce jobs is both difficult and limiting in terms of functionality, and Hadoop cluster management is notoriously difficult, so it’s no surprise that Cascading and Elastic MapReduce are an appealing combination.
They’re hardly the only options for simplifying the Hadoop process, though. Infochimps, for example, offers a service that features tools for automating configuration and creating dataflows using Ruby. Mortar Data has created a managed service that runs atop AWS and lets users process data using Python scripts. And, of course, there are numerous Elastic MapReduce competitors on the market, including offerings from Microsoft, IBM and Sungard.
Feature image courtesy of Shutterstock user JTP.
Disclosure: Concurrent is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, the founder of Giga Omni Media, is also a venture partner at True.