3 Comments

Summary:

LinkedIn’s new University Pages are a case study in how to build a big data application. Ideas are great and pretty web design are great, but you also need people who can find and format the data, the the systems in place to make everything work.

overall_arch_u_page

Professional network LinkedIn rolled out its new University Pages feature last week to much fanfare, but the pages are as much as a matter of smart engineering as they are of smart business strategy. On Monday, LinkedIn Engineering published a blog post explaining the technology behind University Pages, and it underscores the importance of understanding your products, your data and the tools you need to process it.

Of course the new product started with an idea, but after that, blog post author Josh Clemm noted, the company’s data scientists spent years combing through member profiles, gathering and standardizing data about 23,000 colleges and universities. They built graph data models for each school, with the school as the primary node and things like related schools and LinkedIn-member alumni as secondary ones. That’s why you’re now able to visit any school’s LinkedIn profile (at least the ones that have been updated to the new format) and see the same information, such as where alumni work and who attended.

Notable alumni Ben Horowitz. The name rings a bell ...

Notable alumnus: Ben Horowitz. The name rings a bell …

Under the covers, University Pages runs atop some serious big data technologies, many of which LinkedIn built itself. Those graphs are all stored in LinkedIn’s new flagship database technology, EspressoDB. Hadoop powered much of the work involved in getting all that data into a standard format. It’s also responsible for generating page information such as “similar schools” and “notable alumni,” which run periodically as batch jobs and then dump results into LinkedIn’s Voldemort NoSQL database for fast access by web users (as well as into EspressDB to help populate the schools’ graphs).

The whole University Pages architecture, which Clemm explains in detail.

The whole University Pages architecture, which Clemm explains in detail.

Two other open source technologies — one called Bobo and another called Zoie (which LinkedIn created) — power search for the new university profiles. LinkedIn’s Databus system streams updates into the search systems to ensure they always have the most up-to-date data.

We actually profiled LinkedIn’s data engineering team, its strategy and several of its key technologies, in a February feature. One of its leaders, Bhaskar Ghosh, will be part of our big data master panel at Structure: Europe next month.

Ghosh's diagram of LinkedIn's architecture, from that February feature.

Ghosh’s diagram of LinkedIn’s architecture, from that February feature.

But the main takeaway here isn’t that LinkedIn is great or that University Pages are great. In fact, as someone officially (I think) finished with university education, I could take it or leave it. The point is that there’s a right way to build “big data” applications, and web companies seem to understand this better than most. You see similar strategies in place at Netflix, Facebook, Google and elsewhere.

They’ve built virtuous circles where infrastructure systems, data scientists, web developers and product managers all enable each other to do their jobs better. If one piece is weak, everyone suffers. And most importantly, the product suffers.

You’re subscribed! If you like, you can update your settings

  1. Great write up, and very informative info!

  2. Are there two Ben Horowitzes? If not, it’s “alumnus,” not “alumni.”

    Cheers,

    –rj

    1. “Notable alumni” is a list – notice the “”. Ben Horowitz is first on the list. The use of “alumni” here is fine.

Comments have been disabled for this post