If you know Java, R or SAS, doing machine learning on Hadoop data just got a lot easier. Concurrent (see disclosure), the company behind the popular Cascading framework for writing big data jobs, has developed a new open source tool called Pattern that lets users export their models from statistical analysis applications and run THEM? at scale on Hadoop data with little to no code change.
The reason for creating Pattern is pretty simple, according to Concurrent founder and CTO Chris Wensel: “Hadoop is never used alone.” It’s always part of a data environment that also includes databases, visualization tools, analytics software and/or statistical analysis tools that arguably do the really valuable work. Hadoop’s real value is an integration platform that can feed data into these other systems and, ideally, put their outputs to work across much larger datasets.
Developers can use the Pattern Java API to create machine learning jobs, but they can also simply export a Predictive Model Markup Language (PMML) file from software like R, SAS and MicroStrategy that Pattern will read and run them as a Cascading workflow. Models are useless unless you can run them in production, Wensel said, and Pattern lets them run across more data, stored in Hadoop, than you can use to build them with those other tools.
However, Wensel noted, “The real takeaway isn’t Pattern itself.”
From his perspective, the real story is Pattern plus Cascading plus Lingual, the open source SQL-to-Hadoop tool that Concurrent recently developed and released. Lingual is the tie that binds everything together, creating a sort of assembly line for data as it works its way from generation to delivering some value. For example, someone might create a Cascading job that adds structure to incoming data, and then pull some of the data into R using Lingual. Once a model is created in R and exported to the Hadoop cluster using Pattern, Lingual can feed the MapReduce output file back to R so a data scientist can test the model’s accuracy.
And actually, Wensel said, Lingual could have a positive effect on companies’ bottom lines. Airbnb recently replaced a departed engineer with Lingual for monthly migrations of data from Hadoop and into SQL environments. Climate Corporation, a massive Hadoop and Cascading user, could use Lingual to let its crop-and-weather insurance customers access their data from the company’s Hadoop store.
Lingual and Pattern should help Concurrent finally make some money, too. Both of them, as well as the Cascading framework that underpins them, will always be open source, Wensel said, but it plans to create “a suite of products that will make your life much better if … you standardize on Cascading.”
For example, the company has the ability to monitor jobs at the application level rather than the cluster level, meaning it can tell you the details of that job that’s locking up all the resources and whether you really want to kill it (it might be an important report for the CFO …). “We can do some really interesting things,” Wensel said.
Disclosure: Concurrent is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, the founder of Giga Omni Media, is also a venture partner at True.
This post was updated at 2:48pm PT to correct Chris Wensel’s title. He is CTO.
Feature image courtesy of Shutterstock user PENGYOU91.