LinkedIn details Pinot, its new analytics system that powers its data applications

4 Comments

Credit: Jonathan Vanian

LinkedIn has developed a new real-time data analytics system dubbed Pinot, that the social network uses to develop various data-intensive capabilities across its huge infrastructure. The new system is the underpinning of all of LinkedIn’s major data applications.

With Pinot as the core of how [company]LinkedIn[/company] processes data, the company can craft new features on top of it for either its members, companies that are part of its ad network, or anything that LinkedIn’s product team dreams up in regard to how they want to play with its enormous amount of user data. LinkedIn’s various data-tracking features, like generating who has recently viewed a user’s profile and keeping tabs on which people are following particular companies, are all fueled by Pinot.

It used to be that when LinkedIn was a smaller startup, its engineering team was split up into many different groups that each used a host of contrasting data and storage systems, such as the Oracle relational database management system (RDBMS) for querying and Voldemort for key-value storage, explained Praveen Neppalli Naga, an engineer and manager at Linkedin and author of the blog post detailing Pinot. As LinkedIn continued to gain in popularity, so did its collection of user data, and all those different systems made it difficult for LinkedIn to scale.

It was then that Alex Vauthey, LinkedIn’s vice president of engineering, tasked Naga and a small engineering team to build a new centralized system that could consolidate all of LinkedIn’s data and make it easier for the startup to build new data-intensive products on top of the system.

“There was not one proper solution that would be leverageable across the company,” said Naga.

Praveen Neppalli Naga

Praveen Neppalli Naga

To centralize LinkedIn’s data, Naga and his team decided to use a Hadoop infrastructure model as the base of Pinot, with home-built modifications that could help them achieve what they wanted to do analytics-wise. With Hadoop acting as LinkedIn’s data warehouse, the team wrote Hadoop scripts that could retrieve the user data—everything from what college a person attended to what job skills they have listed—which then gets indexed.

From the LinkedIn blog post:
[blockquote person=”LinkedIn” attribution=”LinkedIn”]LinkedIn data has a lot of depth and each dimension requires special treatment. We needed to build custom compression techniques to fit every dimension, in order to get optimal scan speed tradeoff vs. memory consumed. For example, each one of our members can have hundreds of skills and representing them per event is difficult. Similarly, groups that members belong to and companies they follow are some of the dimensions difficult to represent per event. We built Pinot with this difficult to index data in mind, but will save the details of the compression techniques for future posts.[/blockquote]

Pinot needed to be able to support multiple types of data indexing, because LinkedIn contains many pieces of data with multiple variables that require them to be indexed in specific ways. For instance, a college that a person attended is a data point that will never change, therefore it should be indexed in a way that’s different than the type of skills that a person has, which is subject to change.

LinkedIn Pinot System

LinkedIn Pinot System

To make sure that whenever a user sends a request or query into LinkedIn, like seeing who recently viewed one’s profile, is fast and accurate, LinkedIn’s engineering team also had to figure out a way to make sure that the newest data pertaining to the request is readily available while the older data remains accessible, but does not hinder the user’s query by getting mixed into the fresher data. If a user wants to retrieve data based on a specific time, the older data is still available to users but it’s not stored in the same manner as newer data, which needs to be accessed more often; LinkedIn uses Apache Kafka to help with this real-time data indexing process.

Ultimately, it took LinkedIn roughly two years to develop Pinot, and the company tested the first version of it in December 2012, said Naga. Pinot is now the “de-facto analytics platform for LinkedIn.” he added.

“LinkedIn is all about data in the end of the day,” said Naga. “We would love to open source this infrastructure to build a community around it.”

4 Comments

Peter Fretty

Love seeing these successes. It perfectly correlates with the results of a recent SAS survey where the top data objectives were identified as operational improvements, customer experience enhancement and creation of new products.

Peter Fretty

Flavio Graf

@praveenTweets it is clear that for a highly interconnected data-set like the one Linkedin has, the proper architecture should be based on a graph database rather than on columnar indexes.

colinmutter

I agree – I’m curious why the use of other distributed graph technologies like Titan or Cayley wouldn’t work. I’d also be curious to know what the Kafka pipeline ultimately does with the data, as it sounds like a pretty interesting way to solve for this data indexing challenge.

Comments are closed.