5 Comments

Summary:

Concerned about proprietary and expensive forks of Hadoop, T-Systems’ Juergen Urbanski explains how to tell if you are buying an open version of Hadoop or something you might later regret.

Hadoop is fast becoming the preferred way to store and process big data. By T-System’s estimates, in five years, 80 percent of all new data will first land in Hadoop’s distributed file system (HDFS) or in alternative Object Storage architectures.

Yet with the excitement around this open source framework, enterprise users risk overlooking that all Hadoop flavors are not created equal. Choosing one implementation over another can mean veering off the path of genuine open source software and instead heading down the dead-end street of expensive vendor lock-in and stunted innovation.

A little history lesson

The enterprise tech world has been there before. Remember the Unix vs. Linux schism? The former started as a project at Bell Labs and UC Berkeley in the 1970s. Unix was acclaimed for its performance, stability and scalability. It was cutting-edge back then when it came to multi-user and multitasking capabilities, support of IP networks, tools and GUI.

Unix also was a cash cow large software vendors desperately wanted to milk. They developed powerful, yet proprietary versions of Unix during the 1980s. The list of derivatives is long and comprises HP-UX by Hewlett-Packard, DG/UX by Data General, AIX by IBM, IRIX by Silicon Graphics, as well as SUN’s Solaris. The consequences were a fragmentation of Unix that held users captive once they’d settled on a flavor. Once switching becomes difficult, painful and expensive, it stunts innovation.

Linux followed a different path. This open source operating system has thrived ever since it was born in the early 1990s, thanks to a global community of developers. It quickly caught up to its older, more established rival in terms of performance and feature set because it was not trapped in proprietary silos. And it handily beat Unix in terms of capital expenses and operating costs. Since Linux runs on off-the-shelf hardware, it has important similarities to today’s Hadoop world of inexpensive commodity hardware.

Hadoop as the OS for big data

Fast forward to 2013, and you’ll see quite a few lessons that apply to Hadoop and enterprise customers who are thinking about implementing it to meet their big data needs.

Hadoop is like the emerging “operating system” for big data. It enables an organization to use distributed hardware resources to store and compute on large data sets. The world of big data and its appropriate frameworks and tools changes fast, and just like Linux, Hadoop thrives because of a community of hundreds of developers who put their time, resources and passion into making it better all the time.

Systems integrators, service providers and enterprise customers are well advised to carefully pick Hadoop distributions that are truly open source, with room to grow as the technology keeps maturing. One example is Palo Alto-based startup Hortonworks, host of the sixth annual Hadoop Summit being held this week in San Jose. Another example, though with some functionality that is for now held-back from true open source licensing, is Cloudera.

At the other end of the spectrum are vendors that only pay lip service to Hadoop. They embrace and extend it with their own proprietary twists and in the process create fragmentation. While I cannot name any particular vendor because of my professional obligations, you can imagine this category to include some of the large-cap IT vendors who are now muscling into big data.

As the big data technologist for Deutsche Telekom, possibly Silicon Valley’s single largest European customer of IT products and services, I’ve seen enough Hadoop offerings and implementations to suggest a quick, two-step reality check.

How to test your Hadoop distro’s open source street cred

Step one: Can you open the hood and see the engine? Free software doesn’t equal open source. What might be free today can become a costly piece of software once the vendor realizes he is sitting on a gold mine and raises prices, knowing his users are locked in. Open source, on the other hand, means the source code is open for everybody to see and complies with one of several libraries for open source license (see for instance http://opensource.org/licenses). Open source means you can look under the hood, tinker with, enhance and maintain your big data “operating system”.

Step two: The second telltale sign a vendor really means it when he proclaims his love for Hadoop is this: do they have skin in the game of give and take? The Hadoop community breaks into three types called reviewers, contributors and committers. The latter are the seasoned members who set the roadmap, coordinate development and make sure all the pieces eventually click and run. If you want to know whether a vendor really means it, check how many of his employees are Hadoop committers.

If you don’t see a significant number of committers in a company’s ranks, you can be fairly certain that this particular vendor is just going through the motions with Hadoop. They may be checking a mandatory list of features and tools to lure customers to their version of Hadoop, but are most likely pursuing a strategy of forking a great idea for their own gain.

Do your organization and the value hidden in your datasets a favor and don’t become a Hadoop hostage. You risk paying ransom for years or even decades to come while missing out on waves of innovation that will make the entire economy relentlessly data-driven.

Picking the open source or the proprietary path for Hadoop is not just a decision every IT department has to make for itself. Taken together, these decisions will determine whether Hadoop will go the way of Unix or Linux, become a lock-in legacy or break-out success. I’m personally rooting for the latter option because Hadoop is a powerful framework with lots of promise.

Juergen Urbanski, VP Big Data Architectures & Technologies at T-Systems, the enterprise arm of Deutsche Telekom. He writes here in a personal capacity.

You’re subscribed! If you like, you can update your settings

  1. Hm-m, Juergen. You have forgotten to mention the rule #1 : do not use proprietary API or Hadoop extensions, use only public Hadoop API and you are fine. You will be able to switch to another Hadoop distribution later on once you decide its necessary.

    Stop spreading “Its not open source” phobia. 99% of all IT/software market belong to private, proprietary and closed source products.

  2. ciberelmaster Wednesday, June 26, 2013

    Reblogueó esto en Que Hay Dentro De…y comentado:
    New Reblog

  3. Juergen Urbanski Friday, June 28, 2013

    Hi Vlad, I agree with your comment about the importance of using the public APIs not forked ones. Michael Segel who runs the Hardcore Hadoop Group on LinkedIn remarked that looking at the RDBMs wars of the 90’s might be an even better comparison because you can equate ANSI SQL to the Apache Hadoop APIs.
    From a service provider perspective, I would like to avoid vendor lock-in at the level of the physical infrastructure (x86) and the data management software (Hadoop). Further up in the stack, at the application level, we are much more willing to embrace the value-add of proprietary solutions.

  4. Hi Juergen,

    I will propose a different analogy. Competitive advantage in this game is derived from barriers to entry and it lies with the one who owns the operating system and I am not referring to Linux. Operating System today is the Cloud. Cloud services built on Hadoop or its variants and the Cloud itself form a virtuous circle.

    Operating System == Cloud
    Data Store == HDFS / Hadoop or whatever else that runs on top of HDFS
    Applications == Cloud based services

    This tilts competitive advantage towards cloud providers which will be very apparent in coming years.

  5. Linux beat UNIX “in terms of capital expenses and operating costs”. That’s right, but not in innovation lots of which Linux borrowed from UNIX.

Comments have been disabled for this post