Remember that leap-second incident that threw much of the internet last summer? Facebook does. Its servers suddenly hit 100 percent CPU utilization, and as a result, a breaker failed at a site Facebook’s leases in Virginia, bringing down a few rows of gear — something like 300 racks.
The event hardly threw off Facebook’s entire footprint. But it did get engineers thinking more about writing software that integrates third-party building management software with home-cooked tools for monitoring server performance, Tom Furlong, Facebook’s vice president of site operations, said in an interview during the Datacenter Dynamics Converged event in San Francisco on Friday.
The combined system can take into account outdoor information such as temperature and humidity, power consumption for an entire building, and also data on CPU, storage and memory.
Over the past few months Facebook has been rolling out the new data center infrastructure management (DCIM) program and a new cluster-planning system for visualizing all the data. The plan is to roll out the program more widely this year.
The tack Facebook is taking helps in a couple of ways. The software can reduce the amount of time engineers spend to figure out how to rearrange equipment to improve performance. How big of a difference can it make? “Thirty minutes instead of 12 hours worth of drawings and other things,” Furlong said.
It also can contribute to the noble cause of getting Facebook to squeeze the most efficiency out of its existing data centers — and, by extension, precluding the need for yet another data center.
Furlong expected the company to talk more about the system at the next Open Compute Summit in January. He wasn’t sure if the company would make the tool available for public consumption in a similar way that it has disclosed hardware designs in the Open Compute Project. The hitch is that the combined program incorporates some existing internal Facebook monitoring tools, which the company might not want to expose.
But regardless of whether that happens, public discussion of the initiative — Furlong talked it up in general terms before data center aficionados during a session at the Friday conference — can give people an idea of the next logical step for improving efficiency with existing hardware and being smarter about how and when to bring in new equipment that best fits workloads.