Pinterest’s engineering team has created a self-serving platform — comprised of home grown, open source and commercial tools — that hooks in with Hadoop to orchestrate and process all of the company’s enormous amounts of data, consisting of 30 billion pins. The company shared a few details behind the project for the first time on Thursday.
The heart of Pinterest is the ability of its users to create and share personalized boards that display pins — essentially, visual bookmarks to various places on the web — and so the company needed to find a way to obtain insights into all that user activity and provide a more enjoyable experience through related pins, search indexing and image processing. These are all heavy-duty tasks that fall beyond the scope of Hadoop by itself.
Currently, the social scrapbook and visual discovery site has roughly 10 petabytes of data in Amazon(S AMZN) S3 and processes about a petabyte of data daily using Hadoop, according to a blog post from Pinterest data engineer Mohammad Shahangian published Thursday. Although Hadoop is great at storing and processing that data, Shahangian wrote, it’s not an easy system to use by itself and it requires some customization with the help of other services to really take advantage of its capabilities.
To help Hadoop process data faster, the Pinterest team uses MapReduce as a way to separate compute and storage from each other. With MapReduce, Pinterest’s Hadoop clusters can be synced together with S3 so that they can process all of the stored data in the Amazon cloud. This is helpful in that if one of the clusters were to go down or was in need of a hard reset, no work would be lost; all the data would remain in S3 as opposed to having everything be held up in the Hadoop clusters.
The Pinterest team also decided on using Hive in conjunction with Hadoop and found that Hive’s ability to store metadata was helpful in that it could catalog all the data that comes from each Hadoop job. Hive also allows the team to take advantage of some of the functionality of common SQL tools, like the ability to list tables and detail the information within them; this is supposedly a much more simple interface to work.
From the blog post:
[blockquote person=”Pinterest” attribution=”Pinterest”]We orchestrate all our jobs (whether Hive, Cascading, HadoopStreaming or otherwise) in such a way that they keep the HiveMetastore consistent with what data exists on disk. This makes is possible to update data on disk across multiple clusters and workflows without having to worry about any consumer getting partial data.[/blockquote]
Pinterest also uses Puppet, the open source configuration management tool produced by Puppet Labs (see disclosure), as a way to keep track of its customized system.
From the blog post:
[blockquote person=”Pinterest” attribution=”Pinterest”]Puppet had one major limitation for our use case: when we add new nodes to our production systems, they simultaneously contact the Puppet master to pull down new configurations and often overwhelm the master node, causing several failure scenarios. To get around this single point of failure, we made Puppet clients “masterless,” by allowing them to pull their configuration from S3 and set up a service that’s responsible for keeping S3 configurations in sync with the Puppet master. [/blockquote]
Another thing of interest is the fact that Pinterest decided to go with the Hadoop-as-a-service startup Qubole to take care of its Hadoop jobs, because Amazon’s Elastic MapReduce (EMR) had trouble performing when the company grew out to over a couple hundred nodes. As Shahangian details, Qubole, whose founders are two former Facebook engineers responsible for helping create Hive, had no problem scaling out horizontally to thousands of nodes on only one cluster and its efficiency led to a 30-to-60 percent increase in throughput compared with when the company was using EMR.
Eventually, Pinterest plans to experiment with Hadoop 2, which can help manage resources in a cluster all by itself (unlike Hadoop), but so far the company seems satisfied enough with its souped-up version of Hadoop powered by services like Qubole and Puppet.
Disclosure: Puppet Labs is backed by True Ventures, a venture capital firm that is an investor in Gigaom.
Post and thumbnail images courtesy of Shutterstock user Roxanne Ready.