Blog Post

Google Infrastructure Czar: Cloud Gets It Done

It was nearly five years ago when I last spent time with Urs Hölzle, Google’s infrastructure czar. (His official title is SVP of operations.) It was around that time he introduced me (and several others) to many of the concepts (such as cloud and big data) that are now part of the technology sector’s vernacular. Hölzle was company’s first VP of engineering, and he has led the development of Google’s technical infrastructure.

Hölzle’s current responsibilities include the design and operation of the servers, networks and data centers that power Google. It would be an understatement to say that he is amongst the folks who have shaped the modern web-infrastructure and cloud-related standards. When I had a chance to chat with him recently, my question was, “How do you define the cloud?”

I wanted to know, because frankly, the usage of “cloud” has been hijacked for marketing purposes, thanks to indiscriminate labeling of anything and everything on the Internet. Urs has a pretty clear and concise idea of what the cloud means to him (and Google) and what it’s good for.

Tiny Machines

For Hölzle, cloud-based computing comes into play when you have “very big computers that are basically buildings” or very small computers (such as smart phones) — and nothing in between. The small devices can get all the functionality from the big computer, which he has defined as the warehouse-scale computer.

“Here’s my Nexus S, and it can actually run many, [but] not quite all, Google apps, even though it has a tiny processor,” says Hölzle. “There may not always be photos on it, but all your photos are reachable from here. And maybe not all of your email is on it, but all your email is reachable from here.” It doesn’t matter whether the data is in Iowa, Oregon or halfway across the world from where you are.

As a person, you don’t have to worry about backups and viruses, Hölzle says. You don’t have to worry “about changing machines, reconfiguring your machine or worry about installing software. “On the cloud, there is no concept of scheduled downtime, because the cloud is supposed to work all the time. In other words, cloud-based computing helps “you just actually do the work that your company was founded for, instead of focusing on the technology behind the tools that you’re using.”

The Cloud Tone vs. Dial Tone

One of the big challenges of cloud-based computing is reliability and availability. Remember when millions of us were impacted by outages at Google’s Gmail and Skype’s voice service? There was a serious dip in productivity when those apps took a nosedive.

Perhaps, we should be asking for five nines of reliability from our cloud services? Five nines (99.999%) is a concept associated with services that have less than 5.26 minutes of downtime every year — like the dial tone on a landline phone. After all, that’s the only way to trust and rely on the cloud for all the important things we do and services we use.

Hölzle believes that in the cloud-centric world, there needs to be a fine balance between convenience and reliability. He compares phone systems: Mobile phones aren’t as dependable as land lines, but they are more useful because they can be used in different locations for different type of needs. “Mobile phones are sort of overtaking landlines because they add additional functionality that is worth the annoyance,” he says. Instead of five-nines, Hölzle says that cloud-apps should aim for being always available.

The only way to get to zero outages is to try to not make any changes in your cloud app. “In our apps, we’re not actually shooting for five nines because that would lower the feature velocity,” says Hölzle. Put another way: When you make changes, problems happen. So to get to zero outages, one needs to essentially make no changes. The landline phone system didn’t have many changes, and as a result, it became a paragon of five nines.

“Whenever there is an update with new features, it introduces more risk, and that will cause more downtime, invariably, because humans are not perfect, and occasionally something is going to go wrong at a small scale, and hopefully, very, very rarely something may go wrong at a larger scale. But it does happen,” he adds.

Gmail’s most recent outage not withstanding, it takes about 30-odd Google employees help keep Gmail running, Urs says. However, there’s a much larger infrastructure team behind Gmail and other Google services. “If we had just email, then we would have to build all of that on top of building the actual email product,” he argues.

Others might disagree, but Hölzle believes Google’s common infrastructure gives it a technological and financial edge over on-premise solutions. “We’re able to avoid some of that fragmentation and build on a common infrastructure,” says Hölzle. “That’s actually one of the big advantages of the cloud.”

[wufoo form=”z7m8z1″ username=”gigaom”]

2 Responses to “Google Infrastructure Czar: Cloud Gets It Done”

  1. Rhagu,

    Your example of Amazon is very illustrative,

    But the issue with the cloud is more complex, let me explain you: Amazon has been working in optimizing their logistics service, such as Fedex and others, but when you reach a logistic level, you just make minor changes in time and big changes each 3 or 5 years, thats how they reach Six sigma, but as Hosle states, there is a continous process from Google labs to test new products, and its being made on a daily basis, the companies have been investing billions of dollars in improving their farm servers, for example the open project of Facebook with the increase performance of the servers, but even if these companies have the money, and the best engineers on the earth, there are moments they can suffer problems, just review what is happening with WordPress,
    The companies are akways looking to reach the 100%, but we are humans, and we are not perfects, so even a great hacker can made a mistake and cause the failure of the landing of the Mars Pathfinder.

    A final thought, I love the definition of the connection between the farm servers and the smartphones and how to simplify to the user.

    Greets to all

  2. Om,

    Nice article on your session with Urs….I have indeed heard some great remarks about him!

    However, I was a little surprised by his statement – “In our apps, we’re not actually shooting for five nines because that would lower the feature velocity”.

    I understand there is a fine balance between reliability (cost) and service delivery (convenience or velocity) in everything we do in life? But that does not mean we trade-off reliability to a level lower than what world-class operational excellence should be? On the contrary, for a cloud based concept of “always available” why wouldn’t the goal be 6-sigma similar to what world-class physical operations companies strife for?

    Let me explain what I mean with an example – Amazon is THE world-class leader in e-commerce fulfillment and ships over 20 million sku’s to 100M+ customers globally. They do well over 500 Million shipments a year to their customers…Even if they deliver at 6-sigma for accuracy of their shipments to the right customers, think about how many customers they deliver the wrong stuff to – it would be 1700 customers they deliver the wrong shipment to every year and end up with free return costs, product replacement costs and BIG customer dissatisfaction! If they settle at only 4-nines similar to what Urs is saying they would be shipping nearly 241,000 shipments to the wrong customers every year!!!

    Do you think they settle for 4-nines in the context of introducing new products, features or global customers creating shipping complexity? Obviously they do not and even for a physical product delivery process like shippping, they push for constant defect reduction and a march towards 6-sigma on a daily basis for customer shipment reliability!

    So why would Google as THE leader in online search, cloud apps and online infrastructure space not aim for a 6-sigma approach and “Always On Online Availability” model? I’m not saying they compromise feature velocity for it….they should launch frequently / make changes but saying we compromise realiability below a certain level also does not make sense especially in a model like Cloud where the fundamental perception is no server downtime etc, right? They as the leader in this space should set the world-class gold standard.

    I’m curious to hear why Urs should not be similarly pushing for it as the goal and what he thinks about it – after all, isn’t he the operational excellence leader @Google with his new title as SVP of Operations?