18 Comments

Summary:

LexisNexis is releasing a set of open-source, data-processing tools it says outperforms Hadoop and even handles workloads that Hadoop presently cannot. There have been calls for a legitimate alternative to Hadoop, and this certainly looks like one.

fighting elephants

LexisNexis is releasing a set of open-source, data-processing tools that it says outperforms Hadoop and even handles workloads Hadoop presently can’t. The technology (and new business line) is called HPCC Systems, and was created 10 years ago within the LexisNexis Risk Solutions division that analyzes huge amounts of data for its customers in intelligence, financial services and other high-profile industries. There have been calls for a legitimate alternative to Hadoop, and this certainly looks like one.

According to Armando Escalante, CTO of LexisNexis Risk Solutions, the company decided to release HPCC now because it wanted to get the technology into the community before Hadoop became the de facto option for big data processing. Escalante told me during a phone call that he thinks of Hadoop as “a guy with a machete in front of a jungle — they made a trail,” but that he thinks HPCC is superior.

But in order to compete for mindshare and developers, he said, the company felt it had to open-source the technology. One big thing Hadoop has going for it is its open-source model, Escalante explained, which attracts a lot of developers and a lot of innovation. If his company wanted HPCC to “remain relevant” and keep improving through new use cases and ideas from a new community, the time for release was now and open source had to be the model.

Hadoop, of course, is the Apache Software Foundation project created several years ago by then-Yahoo employee Doug Cutting. It has become a critical tool for web companies — including Yahoo and Facebook — to process their ever-growing volumes of unstructured data, and is fast making its way into organizations of all types and sizes. Hadoop has spawned a number of commercial distributions and products, too, including from Cloudera, EMC  and IBM.

How HPCC works

Hadoop relies on two core components to store and process huge amounts of data: the Hadoop Distributed File System and Hadoop MapReduce. However, as Cloudant CEO Mike Miller explained in a post over the weekend, MapReduce is still a relatively complex language for writing parallel-processing workflows. HPCC seeks to remedy this with its Enterprise Control Language.

Escalante says ECL is a declarative, data-centric language that abstracts a lot of the work necessary within MapReduce. For certain tasks that take a thousand lines of code in MapReduce, he said, ECL only requires 99 lines. Furthermore, he explained, ECL doesn’t care how many nodes are in the cluster because the system automatically distributes data across however many nodes are present. Technically, though, HPCC could run on just a single virtual machine. And, says Escalante, HPCC is written in C++ — like the original Google MapReduce  on which Hadoop MapReduce is based — which he says makes it inherently faster than the Java-based Hadoop version.

HPCC offers two options for processing and serving data: the Thor Data Refinery Cluster and the Roxy Rapid Data Delivery Cluster. Escalante said Thor — so named for its hammer-like approach to solving the problem — crunches, analyzes and indexes huge amounts of data a la Hadoop. Roxie, on the other hand, is more like a traditional relational database or database warehouse that even can serve transactions to a web front end.

We didn’t go into detail on HPCC’s storage component, but Escalante noted that it does utilize a distributed file system, although it can support a variety of off-node storage architectures and/or local solid-state drives.

He added that in order to ensure LexisNexis wasn’t blinded by “eating its own dogfood,” his team hired a Hadoop expert to kick the tires on HPCC. The consultant was impressed, Escalante said, but did note some shortcomings that the team addressed as it readied the technology for release. It also built a converter for migrating Hadoop applications written in the Pig language to ECL.

Can HPCC Systems actually compete?

The million-dollar question is whether HPCC Systems can actually attract an ecosystem of contributors and users that will help it rise above the status of big data also-ran. Escalante thinks it can, in large part because HPCC already has been proven in production dealing with LexisNexis Risk Solutions’ 35,000 data sources, 5,000 transactions per second and large, paying customers. He added that the company also will provide enterprise licenses and proprietary applications in addition to the open-source code. Plus, it already has potential customers lined up.

It’s often said that competition means validation. Hadoop has moved from a niche set of tools to the core of a potentially huge business that’s growing every day, and even Microsoft has a horse in this race with its Dryad set of big data tools. Hadoop has already proven itself, but the companies and organizations relying on it for the their big data strategies can’t rest on their laurels.

Image courtesy of Flickr user NileGuide.com.

You’re subscribed! If you like, you can update your settings

  1. Robbie Williamson Wednesday, June 15, 2011

    I’m all for competition, as I think it pushes everyone to improve, but I don’t know how compelling HPCC’s open-core approach will be.

    1. I agree it will get interesting — if HPCC gets traction — although I don’t know that the open core concern will be too great. Depends on how much the Enterprise version differences are about code rather than services, extra tools, etc. Even Cloudera has an Enterprise edition.

    2. I think you’ll find the differences between the enterprise and community versions here. http://www.hpccsystems.com/products-and-services/products/ee-ce-comparison

  2. This article is too early; this hasn’t been released yet!
    http://hpccsystems.com/download/community-edition
    Binary and source code downloads are coming soon

    1. From the press release: “HPCC Systems will initially release a virtual machine for the community to test, in addition to documentation and training. Full binaries will be released in several weeks and the source code will be released in a few more weeks after the binaries.”

      VM is available at http://hpccsystems.com/download/hpcc-vm-image.

      1. Thanks for the answer; I would much rather have the binaries than a VM; so I guess I’ll have to wait.

  3. I would like to see the ECL language HPCC exposes as a interface language for the processing platform. Distributed computing doesn’t neccessarily means only data processing, and if ECL is not flexible enough (read Turing complete) it could be quite of limitation.

    1. George L’Heureux Davor Friday, June 17, 2011

      There’s a Wikipedia article available here: http://en.wikipedia.org/wiki/ECL_programming_language

    2. George L’Heureux Davor Friday, June 17, 2011

      Sorry, wrong link. This is the one you want to check out. http://en.wikipedia.org/wiki/ECL,_data-centric_programming_language_for_Big_Data

    3. Davor, there is an ECL Programmers Guide available at http://www.lexisnexis.com/risk/about/guides/program-guide.html .

  4. I am looking forward to trying this out. I’m sure it will beat out hadoop on in-memory queries since it’s written in C++ and isn’t married to Java’s virtual memory manager and bloated object sizes.

    1. You have no idea what’s going with JVM in current state. It can actually beat C++ on creating new object with just 10 CPU instructions.

      5000 transactions/second is not a lot, but it surely depends on context.

      This HPCC System is not a Hadoop Killer in any mean if it can’t be easily extended and modified, like Hadoop. That’s what truly good open-source system is about.
      But anyway, We’re always welcomed any competitive solution in the open-source world.

  5. Vladimir Rodionov Wednesday, June 15, 2011

    Roxie is the most interesting part of HPCC … but there are literally no info at all. Does it use the same ECL to retrieve the data? We could try Hadoop for data crunching and combine it with Roxie for analytical investigations (Hive is too slow for interactive processing).

    1. Yes, Roxie uses the same ECL language.

  6. Mihaela Mihaljevic Jakic Wednesday, June 15, 2011

    “HPCC is written in C++”

    I think that this is a killer feature over Hadoop. A major one.

  7. Another open source competitor worth keeping an eye on is Nokia’s Disco, which has an Erlang framework (ideal for dealing with huge numbers of parallel processes), while the user provides the map and reduce functions in Python. See: http://discoproject.org/

  8. Derrick,

    You point out a critical issue facing the analytics market – complexity. ECL and other technologies will move the market forward and allow analytics to become more pervasive across the enterprise. Simplicity is key for wide-spread adoption, especially by the business users.

    Matt Benati
    Senior Director, Analytics
    IBM Netezza

  9. Looks very cool am looking at using this to replace mahourt as the mahout devs seem unable to get a flipping stable relese out there.

Comments have been disabled for this post