Data-integration specialist Syncsort is releasing two new Hadoop tools that it says will give Hadoop users a better, faster experience than they can achieve using Apache Hadoop alone. Unlike some other recent Hadoop announcements, however, Syncsort isn’t looking to replace Apache Hadoop or Cloudera’s commercial version of it, but rather to improve and complement the big data analytics framework.
The first tool is an open source “sort” plugin, which Syncsort actually will be contributing as code into Apache Hadoop. Sorting is the process of sorting data within large sets by specific parameters, explained SyncSort’s Keith Kohl, and the native sort capability in Apache Hadoop lacks in functionality and performance. The plugin allows users to utilize the sort technology of their choice into the Hadoop environment. Along with the plugin, Syncsort is releasing its own improved sort tool, which it calls Hadoop Acceleration.
Syncsort is also chipping in with a new version of its DMExpress product that specifically targets Hadoop environments. That product connects to the Hadoop Distributed File System (HDFS) and includes the new sort functionality, but, Kohl said, stands out because it provides a better way to write MapReduce jobs. With DMExpress Hadoop Edition, customers can create jobs using Syncsort’s graphical user interface while letting the product tackle the tough business of tuning the process and doing the actual map and reduce coding. Kohl points to ComScore to prove that DMExpress plus Hadoop is a powerful combination. ComScore, which already uses DMExpress, achieved double the performance during benchmark testing of the Hadoop edition.
Kohl said he doesn’t see this as any competition to existing Hadoop distributions. Instead, he says Syncsort customers already are embracing Hadoop, and it’s just aiming to give them an easier experience because many lack the requisite skills for advanced MapReduce programming. And by contributing the sort plugin to Apache Hadoop, Syncsort is actually hoping to make that distribution stronger. One company that might feel the pressure from Syncsort, however, is fellow data-integration veteran Pervasive Software, which has been touting its own Hadoop-speeding DataRush technology lately.
Syncsort’s decision to contribute to Apache Hadoop is exactly what I called for late last month in response to yet more public discussion about the complexities of Hadoop and MapReduce, in particular. Cloudera, Yahoo, Facebook and others have been driving innovation inside Apache, but even as those improvements get folded into the official code, there always will be specific pain points that others will feel the need to address. Syncsort is one of many vendors now selling proprietary products to simplify the process of writing MapReduce jobs, but it’s fairly unique by actually contributing its expertise back into the core project to try and fundamentally improve Hadoop itself.
Image courtesy of Flickr user Eric__I_E.