LinkedIn has developed a new real-time data analytics system dubbed Pinot, that the social network uses to develop various data-intensive capabilities across its huge infrastructure. The new system is the underpinning of all of LinkedIn’s major data applications.
With Pinot as the core of how [company]LinkedIn[/company] processes data, the company can craft new features on top of it for either its members, companies that are part of its ad network, or anything that LinkedIn’s product team dreams up in regard to how they want to play with its enormous amount of user data. LinkedIn’s various data-tracking features, like generating who has recently viewed a user’s profile and keeping tabs on which people are following particular companies, are all fueled by Pinot.
It used to be that when LinkedIn was a smaller startup, its engineering team was split up into many different groups that each used a host of contrasting data and storage systems, such as the Oracle relational database management system (RDBMS) for querying and Voldemort for key-value storage, explained Praveen Neppalli Naga, an engineer and manager at Linkedin and author of the blog post detailing Pinot. As LinkedIn continued to gain in popularity, so did its collection of user data, and all those different systems made it difficult for LinkedIn to scale.
It was then that Alex Vauthey, LinkedIn’s vice president of engineering, tasked Naga and a small engineering team to build a new centralized system that could consolidate all of LinkedIn’s data and make it easier for the startup to build new data-intensive products on top of the system.
“There was not one proper solution that would be leverageable across the company,” said Naga.
To centralize LinkedIn’s data, Naga and his team decided to use a Hadoop infrastructure model as the base of Pinot, with home-built modifications that could help them achieve what they wanted to do analytics-wise. With Hadoop acting as LinkedIn’s data warehouse, the team wrote Hadoop scripts that could retrieve the user data—everything from what college a person attended to what job skills they have listed—which then gets indexed.
From the LinkedIn blog post:
[blockquote person=”LinkedIn” attribution=”LinkedIn”]LinkedIn data has a lot of depth and each dimension requires special treatment. We needed to build custom compression techniques to fit every dimension, in order to get optimal scan speed tradeoff vs. memory consumed. For example, each one of our members can have hundreds of skills and representing them per event is difficult. Similarly, groups that members belong to and companies they follow are some of the dimensions difficult to represent per event. We built Pinot with this difficult to index data in mind, but will save the details of the compression techniques for future posts.[/blockquote]
Pinot needed to be able to support multiple types of data indexing, because LinkedIn contains many pieces of data with multiple variables that require them to be indexed in specific ways. For instance, a college that a person attended is a data point that will never change, therefore it should be indexed in a way that’s different than the type of skills that a person has, which is subject to change.
To make sure that whenever a user sends a request or query into LinkedIn, like seeing who recently viewed one’s profile, is fast and accurate, LinkedIn’s engineering team also had to figure out a way to make sure that the newest data pertaining to the request is readily available while the older data remains accessible, but does not hinder the user’s query by getting mixed into the fresher data. If a user wants to retrieve data based on a specific time, the older data is still available to users but it’s not stored in the same manner as newer data, which needs to be accessed more often; LinkedIn uses Apache Kafka to help with this real-time data indexing process.
Ultimately, it took LinkedIn roughly two years to develop Pinot, and the company tested the first version of it in December 2012, said Naga. Pinot is now the “de-facto analytics platform for LinkedIn.” he added.
“LinkedIn is all about data in the end of the day,” said Naga. “We would love to open source this infrastructure to build a community around it.”