For most companies, a one-in-billion probability means they don’t really have to take it into account. But Google, which serves about 7 percent of the world’s overall web traffic, isn’t any ordinary company. At the International Conference on Knowledge Discovery and Data Mining on Monday, Google Research Director Peter Norvig shared some of the considerations that Google takes into account when designing its infrastructure and systems, despite the minute probability of them actually happening.
It wasn’t always this way, but Norvig said growing to its current scale has forced Google to undergo an attitude adjustment with regard to what actually matters. Making sure that its systems keep up and running now means a lot more than simply being prepared for run-of-the-mill power failures or software bugs.
Among the ludicrous-sounding, but actually real concerns that Norvig highlighted were:
- Thieves: People don’t only like to steal data, but sometimes the machines it’s stored in.
- Drunken hunters: You know those large red balls placed on power lines so low-flying aircraft know they’re there? It turns out some hunters like to shoot at them, which isn’t good for keeping the power on.
- Sharks: They’ve been known to bite the fiberoptic cables that traverse oceans.
- Authoritarian governments: Pakistan once wanted to cut off access to what it considered blasphemous YouTube videos, but ended up cutting off all global access to YouTube for a couple hours.
- Cosmic rays: Hey, it could happen, and the effects on computer equipment, and electronic equipment generally, wouldn’t likely be good.
In order to keep Google’s service running in the event something does happen, though, Norvig explained that the company has many strategies in place to recover from or help avoid system failures. These include replication, checkpointing, sharding, monitoring and, when possible, relying on loose consistency, approximate answers and/or incomplete answers.
Google also has to worry about possible software or algorithmic glitches that it might not even occur to other companies to worry about, Norvig noted. For example, even if something works thousands of tests in a row, there’s no guarantee it won’t mess up on the billionth time. The problem for Google is that handles so many transactions it’s bound to find out. Norvig said that evaluated processes or data accuracy at Internet scale can get expensive, so companies have to know how to balance their concerns over accuracy and cost.
A few other interesting nuggets that Norvig shared are Google’s average Power Usage Efficiency (PUE) of 1.16; the fact that its software systems are almost constantly being tweaked; the practice of having researchers embedded in engineering teams working on production services; and Google studies showing that more data trumps better algorithms when it comes to getting the best results.
However, despite the massive scale and seeming complexity of Google’s internal systems, Norvig said there is something of a focus on simplicity. He explained the “five-minute/one-hour rule,” which dictates that if a researcher can’t explain what a new system does within five minutes, and if he can’t demonstrate it to the point where someone else could use it in an hour, it probably needs some refinement.