Sometimes it’s hard for data scientists and big data sets to find each other. That’s the problem that EMC’s Greenplum division and Kaggle are taking on with a new partnership. Kaggle is a predictive modeling platform that sponsors competitions in which data scientists compete to solve big data problems.
Under the new alliance, Kaggle’s community of big data eggheads can use Greenplum’s Chorus big data application to solve real-world problems.
“We’ve had good adoption of Chorus and companies’ internal data workers are using it to do data science so they now have the tools, but honestly they don’t have all the people they need,” said Josh Klahr, VP of product management for Greenplum. “Now you can search the Kaggle community based on rank, expertise, location and invite them to work on your challenge using Greenplum Chorus.”
Playing Yenta to big data players
Kaggle ranks its participants much the way the USTA ranks tennis players. And that community is growing fast — when Kaggle started fundraising in August, there were 11,000 members, now there are close to 60,000, said Anthony Goldbloom, CEO of Kaggle, who said this is the first such vendor partnership Kaggle has done. (GigaOM has worked with Kaggle and Splunk on the GigaOM WordPress Challenge: Splunk Innovation Prospect.)
Also on the partner ecosystem front, Greenplum inked a deal that gives Chorus users access to Gnip’s historical Twitter feeds and will let Chorus users import Twitter streams into their Chorus sandbox for analysis. And finally, Chorus is partnering with Tableau, the popular analytics tool so that users can provision Tableau workbooks from their Chorus data sources.
Big data is one area where building a broad ecosystem of data providers is incredibly important. Putting good data scientists together with great data sets is incredibly important, said Ben Woo, managing director of research firm Neuralytics. “Big data is awfully short on the kinds of people who’ve done this work before and, frankly, people who give a damn. This sort of matchmaking is valuable.”
Goal: Melding public and private data to spark new insights
This convergence of publicly available “sentiment” data from sources like Twitter and internal business data lets data scientists ask interesting questions or find interesting questions to ask. For example, a pharmaceutical company has lots of its own data on a new drug. What it may not have is the sort of information about unforeseen side effects that might surface on Twitter or blogs after the drug is released. “If you can match Twitter feeds and patient forums, you can find out unexpected things — see that maybe people are switching from your drug to another. Analyzing that discussion can be incredibly important,” Goldbloom said.
In related news, Greenplum, as promised last spring, is open sourcing Chorus as the OpenChorus Project under the Apache 2.0 license.