Facebook’s Timeline feature is beautiful, although some revile it. But love it or hate it, Timeline is the engineering equivalent of building a racing bike customized for a specific track, only without testing either until race day. At least that’s how Ryan Mack, an infrastructure engineer with Facebook, makes it seem on a blog posted Thursday detailing how Facebook engineered its Timeline feature.
Making a legacy MySQL database faster.
The blog post also offers a ton of data on how Facebook dealt with petabytes of user data stored in legacy MySQL systems on slow disks that could make Timeline less responsive. The company implemented a new architecture that separates out older data to slower disk storage and keeps more recent and more accessed data stored in flash drives and cached using memcached. From the blog:
Before we began Timeline, our existing data was highly normalized, which required many round trips to the databases. Because of this, we relied on caching to keep everything fast. When data wasn’t found in cache, it was unlikely to be clustered together on disk, which led to lots of potentially slow, random disk IO. …A massive denormalization process was required to ensure all the data necessary for ranking was available in a small number of IO-efficient database requests.
Mack spent a lot of time detailing the challenges of denormalizing the data, which entailed getting an intern to define a custom language that would tell a compiler how to convert old data into the new format and the use of three “data archeologists” to write the conversion rules. It also required Facebook to move older activity data to slower network storage while maintaining acceptable performance. To do this Facebook, “hacked a read-only build of MySQL and deployed hundreds of servers to exert maximum IO pressure and copy this data out in weeks instead of months.”
To speed IO even further, engineers consolidated join tables into a tier of flash-only databases. Since PHP can perform database queries on only one server at a time, Mack said Facebook wrote a parallelizing query proxy that allowed it to query the entire join tier in parallel. Finally Facebook attempted to future-proof its data model, but we’ll see how far that takes it.
In response to a question I sent Mack, said through a spokesman, that it took Facebook two weeks to dump everyone’s old profile data from the existing system into an archive using “some off-the-shelf network attached storage.” That data was then stored on “the new servers that now power Timeline, which, at the time, were fresh and didn’t have anything hitting them from production. We did incremental updates weekly until later, when we had writes going to both systems in real time.” That’s a lot of fresh servers, although Facebook remained mum about the exact data transferred.
Facebook also built the Timeline aggregator to run locally on each database to avoid information traveling over the network unless that information will be displayed on the page. That’s another timesaver. It also used it’s Hip Hop code for speeding up PHP in that aggregator.
Building in parallel.
The other element of the blog that I found amazing was how Facebook apparently denormalized its data, had a product team visualizing the Timeline UI, and built a scalable back end system for everything to run on all at the same time. I’m going to call that crazy, but Mack notes that’s what kept the development time frame to a mere six months.
I asked Facebook about what it learned during that process, because I imagine a lot of other sites would love to be able to develop their front and back end systems in parallel and save the cost of replicating their entire production environment’s data to test it. The response a spokesman emailed me from Mack was as follows:
Layering is a good way of describing the process of building Timeline. To make sure we could work in parallel, we always made sure that there was at least a temporary technology that each part of the team could build on top of. Some of these were pre-existing, but many of them were quick prototypes built in a couple of days as needed. We also had frequent scrums to identify places where people were blocked on each other, and everyone on the team did an excellent job of knowing what others needed or would soon need.
When the production solutions were ready, there was always some friction in migrating from the temporary solution, but usually it was just a few days of work. It was a definite win compared with the weeks or months of delay we’d have faced if we had waited for the production solution up front. Given that we wanted to build the entire system in six months, it was critical that we avoided wasting any time.
From nothing to Timeline in six months is pretty impressive, considering we’re talking about creating something to support more than 800 million users. Now you know a bit more how this Timeline sausage was made.