When you’re running a large web infrastructure, automation is critical to ensure that administrators aren’t spending their every waking seconds putting dealing with downed servers. Google, Yahoo and other pioneers had to figure out how to automate operations in their data centers, and now it’s Facebook’s turn. On the Facebook Engineering blog today, the aptly named Pat Power describes how his team of two keeps about half of Facebook’s infrastructure up and running by fixing server problems automatically.
Named the Facebook Auto-Remediation system, or FBAR, Power’s creation is a system of scripts, APIs and plugins that work together get failed servers back online. At a high level, FBAR works by constantly scanning Facebook’s monitoring system for new outages, then undertaking a workflow to fix the problem. Because FBAR has access to hardware and configuration data, as well as the ability to execute commands on host servers, it’s able to solve some issues by itself. Others — such as a failed hard drive — are marked for human resolution.
The idea of self-healing systems is nothing new, of course — Google and other web properties do it to some degree, and IBM has been pushing Autonomic Computing for years — but it’s interesting to see new approaches to the problem. Additionally, it’s fascinating to see how a well-designed system can eliminate the need for huge IT departments. As Power notes:
Today, the FBAR service is developed and maintained by two full time engineers, but according to the most recent metrics, it’s doing the work of approximately 200 full time system administrators. FBAR now manages more than 50% of the Facebook infrastructure and we’ve found that services have dramatic increases in reliability when they go under FBAR control.