1 Comment

Summary:

Facebook had a problem ensuring that its tens of thousands of caching servers were performing at their peak. If they bogged down, so did that site. So Facebook created Claspin, a heat map of the health of its caches that builds on its institutional knowledge.

Claspin in action. Green is good.
photo: Facebook

Facebook’s engineering team has created a tool to help determine how well cache stores are performing at the social networking giant. The tool, dubbed Claspin, is detailed on a Facebook engineering blog published on Wednesday.

The solution is an elegant example of translating institutional knowledge about how Facebook’s cache systems work into a easy-to-understand data visualization that allows others to quickly spot a problem and fix it. Sean Lynch, a Facebook engineer, described how, with tens of thousands of memcache servers and a caching graph database called TAO, it was sometimes hard to quickly figure out when a faulty cache was bogging down the system.

Caching is one of the most essential features of the Facebook infrastructure because it keeps the data stored most often closer to the web servers, delivering content to users as fast as possible. This is essential when you realize that Facebook is pulling up content such as photos on the Timeline on a scrolling basis.

Claspin in action. Green is good.

So caching is a really big deal to keep the user experience high. But Lynch writes that the social network had no way to troubleshoot the cache. From the post:

Facebook has two major cache systems: Memcache, which is a simple lookaside cache with most of its smarts in the client, and TAO, a caching graph database that does its own queries to MySQL. Between these two systems, we have literally thousands of charts, some of which are collected into dashboards showing various latency, request rate, and error rate statistics collected by clients and servers. Most of these graphs and dashboards are backed by Facebook’s “Operations Data Store,” or ODS. This worked well at first, but as Facebook grew both in size and complexity, it became more and more difficult to figure out which piece was broken when something went wrong. So I started thinking about encoding the “tribal knowledge” we used for troubleshooting in a way that would make it easy for people to assess the status of cache at a glance.

Lynch developed algorithms to rank what metrics are most important indicators of cache health, based on his team’s knowledge, and then decided to show the results of that number crunching using a heat map that could capture 10,000 servers on a single screen. The color changes were backed by 30 or more stats being calculated in the background in real-time. He dubbed the tool Claspin, and its use is spreading inside Facebook –arguably the sign of success in the engineering-dominated company.

Facebook’s dedication to better tooling reminds me of Dell’s fanatical efforts to improve production at its factories, measuring everything from table height to where pieces were laid out for assembly. Facebook’s attention here is similar in that these types of tools help it ensure that the network and the ad revenue generated by the network are optimized at all times. Facebook is developing its own Six Sigma or Kaizen efforts for the web.

  1. The key here is the visualisation of the monitoring data. In a sea of green it’s easy to pick out the lighter greens, oranges and reds which makes it very easy to respresent huge numbers of devices in a small area. The difficult bit is determining what “health” means as this requires a deep knowledge not only of the systems themselves but how they should be acting and performing.

    Share

Comments have been disabled for this post