Blog Post

How Facebook Squeezes More From Its Machines

Back in June, during an onstage conversation with me at our Structure 09 conference, Facebook VP of Technical Operations Jonathan Heiliger lamented how chip makers such as Intel and AMD don’t quite understand the needs of web behemoths like his company, instead touting benchmarks and metrics that are far removed from reality.

Industry-standard benchmarks, such as those published by The Standard Performance Evaluation Corporation (SPEC) can be reasonable indicators of maximum throughput for certain workloads.  At Facebook, we recognized these benchmarks wouldn’t necessarily represent our application behavior under real-world conditions and developed a proprietary analysis methodology.

Frustration with those specifications is what led to the building of Facebook’s capacity testing tool, Dyno, which the company has been using since July. Yesterday, Jonathan and two of his colleagues, Marco Baray and Jason Taylor, shared some details as to how, exactly, Facebook benchmark’s server performance — and how that’s helped them squeeze the most out of their machines. The findings are shared in a white paper entitled “Real-World Web Application Benchmarking” (embedded at the end of this post.) Taylor spearheaded the project for Facebook. [digg=]

“We wanted to get rid of the ad-hoc nature of server performance measurement and deployment,” said Heiliger of Dyno. Named after a Dynamometer, a device that measures force or power, typically in automobiles, Dyno does the same for Facebook servers. “Effectively we are doing the same for the servers where we are focused on throughput and server capacity,” says Heiliger. “When you get to a certain scale, say, 100 servers, you need to have something like Dyno.” It also allows the company to constantly optimize its software stack to derive the most out of its hardware. “It allows us to more effectively measure the performance of our server infrastructure and then derive the most out of it,” said Heiliger. From the white paper:

Anecdotally, when Facebook switched from an FB-DIMM platform to the Intel San Clemente platform, utilizing DDR2 memory, we observed an unexpected increase in throughput.  This performance boost initiated an investigation that found the web application to be memory- and CPU-bound.  The decreased latency of the DDR2 architecture provided a significant increase in web node throughput.

Baray explained that as Facebook adds more features to its service, it becomes more complex. “The web site becomes heavier, so we need to constantly adapt our capacity and figure out how we manage it smartly,” he said. In order to do that, the company needs to constantly monitor its data as effectively as possible. And that’s where Dyno comes in handy.

For example, the company recently added new servers that were based on Intel’s NehlamNehalem/Tylersburg chip architecture — which delivered a markedly superior performance over its existing Harpertown-based servers. “There was an over 40 percent difference, which is huge when you have thousands of servers,” said Heiliger.

Knowing which servers can handle more loads and provide more throughput allows Facebook to dynamically shift traffic loads around in order to achieve the top performance. Facebook has more than 30,000 servers, according to some estimates. The company adds roughly 10,000 new ones every 18 months. The company can’t afford to not squeeze the most out of its machines.

Real World Bench Marking v10

16 Responses to “How Facebook Squeezes More From Its Machines”

  1. I’m not sure why this paper was released – its much more of an internal thing – facebook did an investigation to profile their architecture, and they found some stuff that they probably already predicted; however, they did do it at a very fine granularity, which is interesting, and are proposing that the typical benchmarks may not be great indicators; alas, their paper only seems to mention benchmarks casually, without delving into the benchmark vs. actualized disparity – a little more analysis was necessary. Their final point is a good one – profiling is a good strategy.

  2. Good article! Did FB replace all old servers with new ones? That is major investment! Using many cheap servers is the Google model, but there must come a time when running a few supper computers in place of thousands of desktop servers is more efficient. Just image the support and maintenance for so many servers!

    • There is always a pricing sweet spot for compute capacity, cost ( net of both opex and capex ) at any given point in time. It makes sense to buy servers with this sweet spot configuration as opposed to cheapest servers ( mostly underpowered ) or the latest greatest stuff. And yes it sometimes just makes sense to recycle all your existing machines at the end of 3 years.

  3. Obviously FB was doing scalability testing before this tool was created or it wouldn’t know that it typically adds 10,000 new servers every 18 months. FB has plenty of real-world workload performance metrics to draw from, and I’m sure it can afford a server or two for its test environment as part of all those server purchase orders. The company should be able to figure out how changes to the site affect workload performance. So, I’m just not sure I understand how FB is using this tool or why they needed to create it in the first place. What does the company do differently now versus before it had the tool? And what was deficient about existing monitoring and capacity planning tools? Certainly FB didn’t need to create a tool to realize that servers with Intel’s new architecture would perform better than existing servers with an older Intel architecture…or that lower-latency RAM would improve the performance of a RAM-intensive workload…

  4. “This performance boost initiated an investigation that found the web application to be memory- and CPU-bound.”

    Wow. They needed an investigation to figure this out? Some people have an intuitive sense of this stuff, and some don’t. Often it’s correlated with age, so I recommend Facebook hire some olds.

    Every time I hear/read the FB guys talking about how they manage their data center, I think about how often I sit waiting for someone’s FB photo to load. (And I seem to remember a GigaOm article about how fancy their photo servers were.)

  5. “Facebook has more than 30,000 servers, according to some estimates, to which it adds roughly 10,000 new ones every month.”

    Are the above numbers correct? Are we saying that FB came this far on 30K servers, and now it is adding 10K servers every month? Put another way, does that mean, FB is adding one-third as many users (or user data processing) as it has right now to its system every month?