Summary:

Facebook wasn’t satisfied with how its legacy tools measured what was working and what wasn’t across the site’s many-thousand servers, so it created a distributed in-memory data store called Scuba to help. Here’s how it works and what other companies can take away from the project.

iStock_000016276118XSmall

Facebook wasn’t satisfied with how its legacy tools measured what was working and what wasn’t across the site’s many-thousand servers, so it created a distributed in-memory data store called Scuba to help. The resulting tool samples information across Facebook’s infrastructure and can serve up hundreds of datasets containing up to hundreds of gigabytes of data and billions of samples. But it takes less than a second to query that mass of information.

The goal is to get a rapid, real-time view of what’s happening in Facebook’s infrastructure and how that affects performance. The project started about a year ago as a discussion among Facebook engineers, writes Lior Abraham, an engineer on the Site Performance team on the Facebook Engineering page. Scuba ended up growing into a real product that more than 500 Facebookers across the site use to answer questions such as why the site load time is increasing in Russia. Or perhaps they want to know how fast photos are loading during peak hours. In an emailed interview Abraham writes:

We’re constantly sampling the state of various machines and systems and storing the samples in memory so we can query them really fast. The data is in raw form so that we can do pretty much any crazy query ad-hoc, with thousands of queries done every day.

So how does the tool work? In typical Facebook fashion Abraham goes into a lot of detail. Essentially the tool samples data across a bunch of servers and stores it in RAM. Then people can query it using a high-level language such as PHP instead of a database query language. But the key tradeoff for speed and ease of use seems to be that the data doesn’t stick around forever. This is for real-time spot checks, not a historical overview.

Abraham said via email:

Since this is mostly designed for sampled systems and performance data, and since we store samples raw and in RAM for speed, we typically only keep around 30 days of data. Some teams may keep data for longer, …. On the infrastructure team we’re mostly concerned with what’s going on *now* or in the very recent past.

The engineers working on the project also built visualization tools to help folks understand the results. And for folks wanting to move beyond the provided visualizations, the Facebook team made it easy to construct what it calls “goggles,” a way to create a custom visualization, such as the one below.

While this isn’t something Facebook is planning to open source, (Update: Facebook emailed to tell me it doesn’t have plans to open source it at this time) I like the attitude toward data these guys seem to have. First, they saw a need and played around trying to create something better than the legacy tools they were using. They didn’t set out to create the perfect tool, which I think is a trap a lot of people trying to build data tools can fall into. Second, they are willing to grab a lot of data and then let it go, which is cool to see, especially since the mantra today seems to be grab data, keep data and hope to use it one day. That’s not necessarily bad, but it can and will get expensive.

And finally, they managed to build a tool that helps them know what they don’t know about how Facebook’s infrastructure affects the site, and made that information easy for a wide variety of people to access. For Facebook, where the infrastructure is a key asset, that knowledge is strategic. Widespread access to data that every employee can understand has a lot of promise, and Scuba should help Facebook realize that. Other businesses should take note.

You’re subscribed! If you like, you can update your settings

Comments have been disabled for this post