16 Comments

Summary:

Hadoop, thanks to the growing importance of Big Data Analytics is gaining traction inside the enterprise. What’s been missing for Big Data Analytics has been a LAMP-like stack. Fortunately, a stack for Big Data aggregation, processing and analytics is on its way.

Many Fortune 500 and mid-size enterprises are funding Hadoop test/dev projects for Big Data analytics, but question how to integrate Hadoop into their standard enterprise architecture. For example, Joe Cunningham, head of technology strategy and innovation at credit card giant Visa, told the audience at last year’s Hadoop World that he would like to see Hadoop evolve from an alpha/beta environment into mainstream use for transaction analysis, but has concerns about integration and operations management.

What’s been missing for Big Data analytics has been a LAMP (Linux, Apache HTTP Server, MySQL and PHP) equivalent. Fortunately, there’s an emerging LAMP-like stack for Big Data aggregation, processing and analytics that includes:

  • Hadoop Distributed File System (HDFS) for storage
  • MapReduce for distributed processing of large data sets on compute clusters
  • HBase for fast read/write access to tabular data
  • Hive for SQL-like queries on large data sets as well as a columnar storage layout using RCFile
  • Flume for log file and streaming data collection, along with Sqoop for database imports
  • JDBC and ODBC drivers to allow tools written for relational databases to access data stored in Hive
  • Hue for user interfaces
  • Pig for dataflow and parallel computations
  • Oozie for workflow
  • Avro for serialization
  • Zookeeper for coordinated service for distributed applications

While that’s still a lot of moving parts for an enterprise to install and manage, we’re almost to a point where there’s an end-to-end “hello world” for analytical data management. If you download Cloudera’s CDH3b2, you can import data with Flume, write it into HDFS, and then run queries using Cloudera’s Beeswax Hive user interface.

With the benefit of this emerging analytical platform, data science is becoming more integral to businesses, and less a quirky, separate function. As an industry, we’ve come a long way since, industry visionary Jim Gray was famously thrown out of the IBM Scientific Center in Los Angeles for failure to adhere to IBM’s dress code.

Adobe’s infrastructure services team has scaled HBase implementations to handle several billion records with access times under 50 milliseconds. Their “Hstack” integrates HDFS, HBase and Zookeeper with a Puppet configuration management tool. Adobe can now automatically deploy a complete analytical data stack across a cluster.

Working with Hive, Facebook created a web-based tool, HiPal, that enables non-engineers to run queries on large data sets, view reports, and test hypotheses using familiar web browser interfaces.

For Hadoop to realize its potential for widespread enterprise adoption, it needs to be as easy to install and use as Lotus 1-2-3 or its successor Microsoft Excel. When Lotus introduced 1-2-3 in 1983, they chose the name to represent the tight integration of three capabilities: a spreadsheet, charting/graphing and simple database operations. As a high school student, I used it to manage the reseller database for a storage startup, Maynard Electronics. Even as a 15-year-old, I found Lotus 1-2-3 easy to use. More recently, with Microsoft Excel 2010 and SQL Server 2008 R2, I can click on Excel ribbon buttons to load and prepare PowerPivot data, create charts and graphs using out-of-the-box templates, and publish on SharePoint for collaboration with colleagues.

“The Fourth Paradigm” quotes Jim Gray as saying “We have to do better producing tools to support the whole research cycle – from data capture and data curation to data analysis and data visualization.” As the Hadoop data stack becomes more LAMP-like, we get closer to realizing Jim’s vision and giving enterprises an end-to-end analytics platform to unlock the power of their Big Data with the ease of use of a Lotus 1-2-3 or Microsoft Excel spreadsheet.

Brett Sheppard is an executive director at Zettaforce.

Related Post from GigaOM PRO (Sub Req’d): The Incredible, Growing, Commercial Hadoop Market

By Brett Sheppard

You're subscribed! If you like, you can update your settings

Related stories

  1. nice, but I don’t expect an easy install (shortly at least) for a solution capable of handling terabytes of data in few seconds.

    Also “Easy install” have never been a criteria for enterprise solutions. Considering the features that Hadoop provides, its worth the complexity to install it.

    If companies don’t have enough expertise to use it, they can stick to Lotus 1-2-3 or Microsoft Excel.

    That’s the open source world, either you adopt, join and contribute, or keep waiting forever.

    Share
  2. Hi!

    Even though I think that the Cloudera distribution is a lot of wholesome goodness, I have to argue with your list above a bit.

    It looks as though you simply read those names off of Cloudera’s web site. Many of those projects are relatively new, and could hardly be referred to as an emerging standard.

    Nice article though, even though it didn’t take a whole lot of research to put together.

    Colin

    Share
  3. Hi Colin,

    Thanks for commenting. The stack components listed here are not specific to any one distribution of Hadoop. Most if not all apply to the Apache Hadoop, Amazon Elastic MapReduce (EMR), Cloudera and Yahoo! distributions. What I personally find interesting about this stack is not that all of the pieces are fully developed yet — for example, Doug Cutting and his colleagues in the Apache Avro project are still developing support for Map/Reduce over Avro for data interchange — but that there’s the potential for something really exciting here for an emerging platform for Big Data analytics.

    Thanks again for commenting. Best regards,

    Brett

    Share
  4. It would be really helpful if you could update the list with proper links to their respective webpages. Still an interesting post.

    Share
    1. Hi Sumit,

      Glad you found the post interesting. There are URL links with documentation for each subproject on the Apache Hadoop page at http://hadoop.apache.org/

      Best regards, Brett

      Share
  5. I agree with Colin, most of these technologies seems like they were read off of Cloudera’s website.

    Listing HBase but not Cassandra or Riak? Avro and not Protobuffs/Thrift/BSON? Does anybody actually use Sqoop?

    Share
    1. Hi Evan,

      Thanks for commenting. Hadoop and its key subprojects — HDFS, HBase, MapReduce, Hive, Chukwa (which I forgot to put in the article list but it should be there too), Pig and ZooKeeper — are not specific to any one vendor. Documentation for each is at the Apache Hadoop page at http://hadoop.apache.org/. You can use them with Apache Hadoop, Amazon EMR, Cloudera and Yahoo! Hadoop distributions.

      The Amazon Dynamo derivatives — Cassandra, Riak and Voldemort — are indeed an interesting topic, but a separate one than the Hadoop subprojects and how as a group they are helping to create the foundation for a Hadoop data analytics platform. Of the three Dynamo derivatives, there are some very interesting deployments of Cassandra, including at Digg and Cloudkick. Basho is working with Riak but I’m less familiar with Riak commercial deployments. Outside of the Dynamo derivatives, there are other worthwhile NoSQL options too like CouchDB for document oriented-databases.

      For serialization, indeed there are other options than Avro, but they may be less useful for working with Hadoop. I’ve seen BSON primarily for MongoDB, and work with Thrift and Protobuf by Google for Java virtual machines (JVM), while Avro began as a Hadoop subproject before graduating to a top-level Apache project.

      Thinking for a moment beyond individual tools, what I personally find particularly interesting is how the overall Hadoop ecosystem is starting to come together to be a viable platform for Big Data analytics.

      Thanks again for the comments! Best regards, Brett

      Share
  6. Daniel Trebbien Monday, August 2, 2010

    Where can I find information on “Hue”?

    Share
    1. Hey Daniel,

      The code for HUE is at http://github.com/cloudera/hue, the mailing list is at https://groups.google.com/a/cloudera.com/group/hue-user, and a blog post introducing HUE is at http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-hue.

      Regards,
      Jeff

      Share
    2. Hi Daniel,

      Info about Hadoop User Interface (HUE), including a github URL link to download the open source code, is at http://www.cloudera.com/blog/2010/07/developing-applications-for-hue/. Karmasphere (www.karmasphere.com) and Datameer (www.datameer.com) have interesting options too for user interfaces.

      Best regards, Brett

      Share
  7. Interesting Article – as a BI architect from the narrow gartner realms of refined and repackaged buyouts I don’t get much choice in selecting or building tailored stacks.
    Having to fiddle components that have been slammed together through a number of mergers and releases before they actually integrate as ‘the enterprise package’ as advertised has many of us watching the OS space ….

    Share
    1. Hi Mike,

      Good perspective to add, thanks. One of the nice things about the growing Hadoop ecosystem is a growing set of interfaces with traditional BI software and tools. Export/import through JDBC and ODBC drivers can introduce latency, which could be a significant problem for operational BI applications but tends to be less of a major concern for Big Data analytics applications. Hadoop by itself doesn’t solve issues with data quality, silo’ed data with different definitions, etc., but you can run data transformations inside Hive. Talend has some interesting perspectives and tools for data integration and data quality within Hadoop — info is at http://www.talend.com/blog/2010/07/01/tackling-big-data-with-hadoop-support/ Thanks again for commenting!

      Best regards,

      Brett

      Share
  8. Brett:

    You should probably add R for open-source statistical analysis. With RHIPE (R and Hadoop Integrated Processing Environment), one can code Hadoop in R.

    Share
    1. Hi Raj,

      Good addition to note, thanks!

      Best regards, Brett

      Share
  9. Agree. Spreadsheets provide an ideal interface for business users grappling with big data, whether it’s offline big data analytics such as Datameer Hadoop, or continuous realtime big data analytics such as Cloudscale or the Dremel work at Google. With Cloudscale, every aspect of realtime big data analytics can be be handled entirely from within our free Excel plug-in – drag-and-drop building block app development, cloud deployment, interactive visualization. The massively parallel analytics heavy lifting is, of course, done in the cloud, or, from later this year, on an in-house Cloudscale architecture.

    Share
  10. [...] Editor’s note: a shorter version of this article appeared on GigaOM. [...]

    Share

Comments have been disabled for this post