Blog Post

Meet the Big Data Equivalent of the LAMP Stack

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Many Fortune 500 and mid-size enterprises are funding Hadoop test/dev projects for Big Data analytics, but question how to integrate Hadoop into their standard enterprise architecture. For example, Joe Cunningham, head of technology strategy and innovation at credit card giant Visa, told the audience at last year’s Hadoop World that he would like to see Hadoop evolve from an alpha/beta environment into mainstream use for transaction analysis, but has concerns about integration and operations management.

What’s been missing for Big Data analytics has been a LAMP (Linux, Apache HTTP Server, MySQL and PHP) equivalent. Fortunately, there’s an emerging LAMP-like stack for Big Data aggregation, processing and analytics that includes:

  • Hadoop Distributed File System (HDFS) for storage
  • MapReduce for distributed processing of large data sets on compute clusters
  • HBase for fast read/write access to tabular data
  • Hive for SQL-like queries on large data sets as well as a columnar storage layout using RCFile
  • Flume for log file and streaming data collection, along with Sqoop for database imports
  • JDBC and ODBC drivers to allow tools written for relational databases to access data stored in Hive
  • Hue for user interfaces
  • Pig for dataflow and parallel computations
  • Oozie for workflow
  • Avro for serialization
  • Zookeeper for coordinated service for distributed applications

While that’s still a lot of moving parts for an enterprise to install and manage, we’re almost to a point where there’s an end-to-end “hello world” for analytical data management. If you download Cloudera’s CDH3b2, you can import data with Flume, write it into HDFS, and then run queries using Cloudera’s Beeswax Hive user interface.

With the benefit of this emerging analytical platform, data science is becoming more integral to businesses, and less a quirky, separate function. As an industry, we’ve come a long way since, industry visionary Jim Gray was famously thrown out of the IBM Scientific Center in Los Angeles for failure to adhere to IBM’s (s IBM) dress code.

Adobe’s (s adbe) infrastructure services team has scaled HBase implementations to handle several billion records with access times under 50 milliseconds. Their “Hstack” integrates HDFS, HBase and Zookeeper with a Puppet configuration management tool. Adobe can now automatically deploy a complete analytical data stack across a cluster.

Working with Hive, Facebook created a web-based tool, HiPal, that enables non-engineers to run queries on large data sets, view reports, and test hypotheses using familiar web browser interfaces.

For Hadoop to realize its potential for widespread enterprise adoption, it needs to be as easy to install and use as Lotus 1-2-3 or its successor Microsoft Excel (s msft). When Lotus introduced 1-2-3 in 1983, they chose the name to represent the tight integration of three capabilities: a spreadsheet, charting/graphing and simple database operations. As a high school student, I used it to manage the reseller database for a storage startup, Maynard Electronics. Even as a 15-year-old, I found Lotus 1-2-3 easy to use. More recently, with Microsoft Excel 2010 and SQL Server 2008 R2, I can click on Excel ribbon buttons to load and prepare PowerPivot data, create charts and graphs using out-of-the-box templates, and publish on SharePoint for collaboration with colleagues.

“The Fourth Paradigm” quotes Jim Gray as saying “We have to do better producing tools to support the whole research cycle – from data capture and data curation to data analysis and data visualization.” As the Hadoop data stack becomes more LAMP-like, we get closer to realizing Jim’s vision and giving enterprises an end-to-end analytics platform to unlock the power of their Big Data with the ease of use of a Lotus 1-2-3 or Microsoft Excel spreadsheet.

Brett Sheppard is an executive director at Zettaforce.

Related Post from GigaOM PRO (Sub Req’d): The Incredible, Growing, Commercial Hadoop Market

16 Responses to “Meet the Big Data Equivalent of the LAMP Stack”

  1. Richard Daley

    If Hadoop is the best thing to happen to data management since the spread sheet, then making Hadoop as easy as Lotus 1-2-3 or Microsoft Excel surely will accelerate Hadoop adoption. The great thing about the LAMP-like stack for Big Data aggregation, processing and analytics described here is that there are so many opportunities for organizations beyond the Hadoop project to make contributions.

    Here at Pentaho we are well along with our beta program to deliver the industry’s first complete end-to-end data integration and business intelligence platform to support Hadoop. Our first deliverable in this initiative is the enhancement of Pentaho Data Integration (PDI) to be the visual design environment for ETL processes that include the manipulation of Hadoop files and the execution of Hadoop tasks. This enables the design and execution of ETL processes that involve both Hadoop and non-Hadoop tasks. This deliverable also includes an embedded ETL engine for Hadoop. By the end of August, Hadoop users will be able to transform, manipulate, and aggregate files and data using the full functionality of a robust graphical designer and powerful ETL engine. The next set of deliverables, to follow soon after, will enable reporting, dashboards and analysis directly against data stored in Hadoop.

    As the open source Business Intelligence leader, Pentaho is doing this at a cost that will make Hadoop not only more user friendly, but also affordable for all.

  2. Agree. Spreadsheets provide an ideal interface for business users grappling with big data, whether it’s offline big data analytics such as Datameer Hadoop, or continuous realtime big data analytics such as Cloudscale or the Dremel work at Google. With Cloudscale, every aspect of realtime big data analytics can be be handled entirely from within our free Excel plug-in – drag-and-drop building block app development, cloud deployment, interactive visualization. The massively parallel analytics heavy lifting is, of course, done in the cloud, or, from later this year, on an in-house Cloudscale architecture.

  3. Interesting Article – as a BI architect from the narrow gartner realms of refined and repackaged buyouts I don’t get much choice in selecting or building tailored stacks.
    Having to fiddle components that have been slammed together through a number of mergers and releases before they actually integrate as ‘the enterprise package’ as advertised has many of us watching the OS space ….

    • Hi Mike,

      Good perspective to add, thanks. One of the nice things about the growing Hadoop ecosystem is a growing set of interfaces with traditional BI software and tools. Export/import through JDBC and ODBC drivers can introduce latency, which could be a significant problem for operational BI applications but tends to be less of a major concern for Big Data analytics applications. Hadoop by itself doesn’t solve issues with data quality, silo’ed data with different definitions, etc., but you can run data transformations inside Hive. Talend has some interesting perspectives and tools for data integration and data quality within Hadoop — info is at Thanks again for commenting!

      Best regards,


  4. I agree with Colin, most of these technologies seems like they were read off of Cloudera’s website.

    Listing HBase but not Cassandra or Riak? Avro and not Protobuffs/Thrift/BSON? Does anybody actually use Sqoop?

    • Hi Evan,

      Thanks for commenting. Hadoop and its key subprojects — HDFS, HBase, MapReduce, Hive, Chukwa (which I forgot to put in the article list but it should be there too), Pig and ZooKeeper — are not specific to any one vendor. Documentation for each is at the Apache Hadoop page at You can use them with Apache Hadoop, Amazon EMR, Cloudera and Yahoo! Hadoop distributions.

      The Amazon Dynamo derivatives — Cassandra, Riak and Voldemort — are indeed an interesting topic, but a separate one than the Hadoop subprojects and how as a group they are helping to create the foundation for a Hadoop data analytics platform. Of the three Dynamo derivatives, there are some very interesting deployments of Cassandra, including at Digg and Cloudkick. Basho is working with Riak but I’m less familiar with Riak commercial deployments. Outside of the Dynamo derivatives, there are other worthwhile NoSQL options too like CouchDB for document oriented-databases.

      For serialization, indeed there are other options than Avro, but they may be less useful for working with Hadoop. I’ve seen BSON primarily for MongoDB, and work with Thrift and Protobuf by Google for Java virtual machines (JVM), while Avro began as a Hadoop subproject before graduating to a top-level Apache project.

      Thinking for a moment beyond individual tools, what I personally find particularly interesting is how the overall Hadoop ecosystem is starting to come together to be a viable platform for Big Data analytics.

      Thanks again for the comments! Best regards, Brett

  5. Hi Colin,

    Thanks for commenting. The stack components listed here are not specific to any one distribution of Hadoop. Most if not all apply to the Apache Hadoop, Amazon Elastic MapReduce (EMR), Cloudera and Yahoo! distributions. What I personally find interesting about this stack is not that all of the pieces are fully developed yet — for example, Doug Cutting and his colleagues in the Apache Avro project are still developing support for Map/Reduce over Avro for data interchange — but that there’s the potential for something really exciting here for an emerging platform for Big Data analytics.

    Thanks again for commenting. Best regards,


  6. Hi!

    Even though I think that the Cloudera distribution is a lot of wholesome goodness, I have to argue with your list above a bit.

    It looks as though you simply read those names off of Cloudera’s web site. Many of those projects are relatively new, and could hardly be referred to as an emerging standard.

    Nice article though, even though it didn’t take a whole lot of research to put together.


  7. nice, but I don’t expect an easy install (shortly at least) for a solution capable of handling terabytes of data in few seconds.

    Also “Easy install” have never been a criteria for enterprise solutions. Considering the features that Hadoop provides, its worth the complexity to install it.

    If companies don’t have enough expertise to use it, they can stick to Lotus 1-2-3 or Microsoft Excel.

    That’s the open source world, either you adopt, join and contribute, or keep waiting forever.