Look under the covers of almost any data-focused web application — including Klout — and you’ll find Hadoop. The open-source big data platform is ideal for storing and processing the large amounts of information needed for Klout to accurately measure and score its users’ social media influence. But Klout also has another important, and very not-open-source, weapon in its arsenal — Microsoft SQL Server.
Considering the affinity most web companies have for open source software, the heavy use of Microsoft technology within Klout is a bit surprising. The rest of the Klout stack reads like a who’s who of hot open source technologies — Hadoop, Hive, HBase, MongoDB, MySQL, Node.js. Even on the administrative side, where open source isn’t always an option, Klout uses the newer and less-expensive Google Apps instead of Microsoft Exchange.
“I would rather go open source, that’s my first choice always,” Klout VP of Engineering Dave Mariani told me during a recent phone call. “But when it comes to open source, scalable analysis tools, they just don’t exist yet.”
Klout’s stack is telling of the state of big data analytics, where tools for analyzing data processed by and stored in Hadoop can be hard to use effectively, and where Microsoft has actually been turning a lot of heads lately.
How Klout does big data
“Data is the chief asset that drives our services,” said Mariani, and being able to understand what that data means is critical. Hadoop alone might be fine if the company were just interested in analyzing and scoring users’ social media activity, but it actually has to satisfy a customer set that includes users, platform partners (e.g., The Palms hotel in Las Vegas, which ties into the Klout API and uses scores to decide whether to upgrade guests’ rooms) and brand partners (the ones who target influencers with Klout Perks).
As it stands today, Hadoop stores all the raw data Klout collects — about a billion signals a day on users, Mariani said — and also stores it in an Apache Hive data warehouse. When Klout’s analysts need to query the data set, they use the Analysis Services tools within SQL Server. But because SQL Server can’t yet talk directly to Hive (or Hadoop, generally), Klout has married Hive to MySQL, which serves as the middleman between the two platforms. Klout loads about 600 million rows a day into SQL Server from Hive.
It’s possible to use Hive for querying data in a SQL-like manner (that’s what it was designed for), but Mariani said it can be slow, difficult and not super-flexible. With SQL Servers, he said, queries across the entire data set usually take less than 10 seconds, and the product helps Klout figure out if its algorithms are putting the right offers in front of the right users and whether those campaigns are having the desired effects.
Analysis Services also functions as a sort of monitoring system, Mariani explained. It lets Klout keep a close eye on moving averages of scores so it can help spot potential problems with its algorithms or problems in the data-collection process that affect users’ scores.
Why Microsoft is turning heads in the Hadoop world
Although Klout is using SQL Server in part because Mariani brought it along with him from Yahoo and Blue Lithium before that, Microsoft’s recent commitment to Hadoop has only helped ensure its continued existence at Klout. Since October, Microsoft has been working with Yahoo spinoff Hortonworks on building distributions of Hadoop for both Windows Server and Windows Azure. It’s also working on connectors for Excel and SQL Server that will help business users access Hadoop data straight from their favorite tools.
Mariani said Klout is working with Microsoft on the SQL Server connector, as his team is anxious to eliminate the extra MySQL hop it currently must take between the two environments.
Microsoft’s work on Hadoop on Windows Azure is actually moving at an impressive pace. The company opened a preview of the service to 400 developers in December and on Tuesday, coinciding with the release of SQL Server 2012, opened it up to 2,000 developers. According to Doug Leland, GM of product management for SQL Server, Hadoop on Windows Azure is expanding its features, too, adding support for the Apache Mahout machine-learning libraries and new failover and disaster recovery capabilities for the problematic NameNode within the Hadoop Distributed File System.
Leland said Microsoft is trying “to provide a service that is very easy to consume for customers of any size,” which means an intuitive interface and methods for analyzing data. Already, he said, Webtrends and the University of Dundee are among the early testers of Hadoop on Windows Azure, with the latter using it for genome analysis.
We’re just three weeks out from Structure: Data in New York, and it looks as if our panel on the future of Hadoop will have a lot to talk about, as will every at the show. As more big companies get involved with Hadoop and the technology gets more accessible, it opens up new possibilities who can leverage big data and how, as well as for an entirely new class of applications that use Hadoop like their predecessors used relational databases.