Infochimps, the Austin, Texas-based startup that transitioned in February from a data marketplace into a big data platform, has stepped up its abilities to handle streaming data. On Tuesday, it unveiled the Infochimps Platform 1.1, which improves the platform’s real-time analytics engine and turns its Wukong command line interface into a tool for writing scripts that can process streaming data.
The company has described its cloud platform
, which is hosted on can run the Rackspace Cloud, as Heroku for Hadoop — although that characterization is becoming antiquated. While Hadoop is certainly an important part of the Infochimps stack, it’s actually not the focal point. “Usually, people come to us because they have a big data problem and heard they should look at Hadoop,” Infochimps CEO Joe Kelly told me, but they end up accomplishing a lot before they ever turn to Hadoop.
What often happens, Infochimps Chief Strategy Officer Dhruv Bansal said, is customers use the platform to build applications that can ingest, process and analyze data, only turning to Hadoop when they get to the point where they actually need to batch analyze large volumes of data. It’s this experience, he said, that led to the focus on the real-time features in the new release.
The new streaming analytics engine, called Data Delivery Service, is based on Apache Flume and lets Infochimps users process data as it flows into their systems. Using Wukong, a Ruby-based command line interface, developers can write big data applications that take advantage of Data Delivery Service, or Hadoop, using a simple grammar that doesn’t involve learning MapReduce or how to work with Flume.
Although, the platform does support other high-level Hadoop languages such as Hive and Pig. “Wukong is one way to interact,” Bansal said, “it’s not the only way.”
With the new version of the platform, developers also have a feature called Deploy Pack, which lets them write and test code locally and then easily push it to the cloud production environment with a single command. Thanks to Ironfan, Infochimps’ infrastructure automation tool, databases, Hadoop clusters and whatever else an application needs will launch with minimal developer effort.
Infochimps can run on users’ cloud infrastructure of choice, and has partnered with Rackspace to run on its OpenStack-based cloud computing platform.
However, the Heroku for Hadoop tag is also a little premature because the Infochimps Platform is still a relatively high-touch service. Although the development and deployment experience is relatively simple, users can’t yet get up and running with only a credit card; they still need to engage with the company on setting up their applications. Part of the reason for this, Kelly explained, is because of users’ skill levels — data scientists might need to hone their coding skills, while developers might need to learn how to write better data flows.
“We’re walking the bridge over the valley [between a do-it-yourself platform and a wholly managed service],” Kelly said. When it crosses the chasm and allows for immediate credit-card access, he added, it will probably be centered on prepackaged solutions based on what’s popular among users. The first 90 percent will be written for users, and they’ll tune the last 10 percent to their specific needs.
Feature image courtesy of Shutterstock user voyager624.