Hortonworks has heard the pleas from its customers, and it’s beginning to integrate the Storm stream-processing engine with the company’s Hadoop distribution. This is a pretty big deal considering Hadoop’s legacy as a relatively slow platform designed for batch processing, but Storm support is just another brick in the wall between Hadoop and its past.
Storm is a stream-processing engine created by a company called Backtype several years ago, largely with the goal of supplementing Hadoop with something that could handle real-time processing of streaming data — stuff like sensor data, clicks and status updates. Twitter bought Backtype in 2011 and has been leading Storm development since. However, the open source project has proven mighty popular among the web set, which has already found some innovative ways of running Storm in conjunction with, or on top of, Hadoop.
But Hortonworks customers had seen what Storm was capable of at places like Twitter and Yahoo(, Hortonworks VP of Marketing Dave McJannet told me on Monday, and they wanted to turn it on their streaming data, too. They want to do things like advanced geofencing, real-time analysis of web activity or perhaps analyze data from medical sensors as it’s generated.
“Stream processing is clearly one of the [desires] we’ve seen come up most often some of our early adopter customers,” he said. “…[It's] flashing as an interest that more and more of our enterprise users focus on.”
When Storm recently became an Apache Software Foundation Incubator project, that was enough for Hortonworks to invest in it, said Bob Page VP of products. The company is always looking to incorporate open source technologies that are related to Hadoop into the Hortonworks Data Platform, but it has to do so with its customers in mind. By Hortonworks’ way of thinking, if it starts integrating technologies without strong community support or technologies too far removed from the Hadoop trunk code, he explained, “We’re putting them at risk.”
Once it decided to invest in Storm, though the question was how to actually do it. “The challenge is how you make [it] enterprise-consumable and appropriate for the mainstream,” McJannet said.
Hortonworks’ plan is to finish the base-level integration before the year’s end, and then to complete two more phases of enterprise feature additions in the near future — probably within a year’s time, Page confirmed. But, he added, “We don’t want to wait to have all these ready before people can start banging on it.”
However, as much as Storm is a near polar opposite of Hadoop MapReduce in terms of when and how it processes data, it’s just the latest processing framework that Hadoop can technically support thanks to a new cluster-management layer called YARN. Hortonworks is working to speed up Hive via a new processing framework called Tez. YARN also lets Hadoop users run the in-memory Spark processing framework, and Microsoft is using YARN to make Hadoop more suitable for machine learning workloads.
Additionally, YARN lets all these different technologies — as well as HBase, Giraph or anything else — run on the same cluster of physical machines. Mesos, another cluster-management technology (developed at the University of California, Berkeley, and now an Apache project), accomplishes a similar goal, although it’s not tied to the Hadoop Distributed File System like YARN is.
All of this activity taken together is much more than the sum of its parts. It’s a pretty clear sign that big data isn’t just a flash in the pan, and that Hadoop will very likely be the dominant platform for running big data applications — whatever shape they happen to take.
Feature image courtesy of Shutterstock user Pictureguy.