Blog Post

Why you should expect more online outages but less downtime

Google’s webmail service Gmail (s goog) was down for 18 minutes last week after a “routine update” broke the service for a few minutes. The search giant reported that it conducted a routine update of its load balancing software between 8:45 AM PT and 9:13 AM PT and after the problems were detected managed to quickly roll back the buggy code. But this didn’t stop some people from questioning why Google would roll out a software update at what are peak email-checking hours on the West Coast.

The answer is that most of the coders behind today’s popular web sites and services are deploying their code when it’s ready — not at some pre-determined point when downtime may not be noticed. It’s called continuous code deployment or some variation on that theme and everyone from Facebook (s fb) and Netflix (s nflx) to smaller services do it. So while it may occasionally cause a few blips, those blips should be shorter and less catastrophic.

There’s no good time for downtime in an always-on world

James Urquhart (Cisco), Luke Kanies (Puppet Labs ), Jesse Robbins (Opscode) - Structure 2011The rationale for doing these sorts of continuous deployments vary, but most fall into four categories. The first is that there really is no good time for downtime anymore, but if you break it, wouldn’t you rather have happy and awake staff on the clock ready to fix it? Jesse Robbins, the chief community officer of Opscode points out that even good times for downtime can vary across customers.

“One of Opscode’s earliest customers is a popular dating website, and their peak traffic is on Friday night when people are exchanging phone numbers to go on dates… the exact opposite of peak time for a CRM,” says Robbins.

Plus, as Robert Treat, COO of OmniTI, a consulting firm that helps web sites scale out their business points out, sometimes deploying at off hours means little because the site won’t actually break until it experiences peak loads. For many of these sites using continuous code deployments scaling its users is what caused the need for new code in the first place. Until the site experiences that load they don’t know if the fixes worked or not.

Just in time code-deployment

markimbiacoThe second category is economic. When you wait to deploy your code in these massive quarterly installs, you’re deciding to avoid the efficiencies that the new code could bring to the site today. This thinking is more common to companies who view their web operations as a fundamental cost of doing business as opposed to some sort of cost center that keeps email up and running.

“Code that has been written but not yet deployed is very similar to inventory,” says Mark Imbriaco of Github. “You’ve paid the cost to develop the software, but are not yet getting any of the benefit from it. Shipping that code to production sooner means that you and your customers can benefit from it much faster. This is a pretty serious competitive advantage for companies that deliver features faster than their competitors.”

Routine code deployments makes for happy developers

Thinking of code deployment as a Big Fat Hairy Deal adds layers of stress and process to getting it into production, but if it’s a routine part of the job, developers can try things out and deploy code and move on with their lives. This reduces stress around the deployment, but it also frees their minds up for new problems and jobs, notes Johns Allspaw of Etsy. Plus he says, “Fast and frequent feedback is what allows for developers to be productive. Developers hate being bored.”

Punishing your site makes it stronger

Aditya Agarwal Dropbox Adrian Cockcroft Netflix Alexei Rodriguez Evernote Corporation
(L to R) Alexei Rodriguez – VP of Operations, Evernote Corporation; Adrian Cockcroft – Director, Architecture, Netflix ; Aditya Agarwal – VP Engineering, Dropbox
(c)2012 Pinar Ozger

The third school of thought is popularized by Netflix and is basically an invitation to break things because a system that is so fragile that one code upgrade brings it down, clearly isn’t resilient enough. In many ways Netflix takes the idea of building out an architecture that’s dependent on a genius IT professional and his version of delicate pieces and crazy glue and flips it on its head. Instead of a fragile model car Netflix is building the Tonka trucks of IT –ready to take a few glitches and keeping on serving up videos.

“Systems that contain and absorb many small failures without breaking and get more resilient over time are “antifragile” as described in [Nassim] Taleb’s latest book,” explains Adrian Cockcroft of Netflix. “We run chaos monkeys and actively try to break our systems regularly so we find the weak spots. Most of the time our end users don’t notice the breakage we induce, and as a result we tend to survive large-scale outages better than more fragile services.”

That’s the rationale behind those software updates that might cause a momentary web service outage or two. As the devops movement spreads, more businesses will likely find reasons to move toward continuous code deployment. Plus, as Allspaw of Etsy points out, the tools to test code and instantly monitor the effects of new deployments are getting better and faster. That means if you accidentally break a site, the dev teams notices it faster and fixes it. So maybe there are more outages, but they shouldn’t last as long.

9 Responses to “Why you should expect more online outages but less downtime”

  1. Our major conflict arises from providing a devops cloud based solution to traditional Enterprise clients. We struggle to convince their IT departments that the benefits of continuous deployment outweigh the negatives. The thought that we will change our service without allowing them to go through a whole staging test procedure appears as anarchy and high-risk to them. We’ve managed to get to a happy medium by running our new code against live traffic using Parallel Proxy, but it will be a few years yet before the principal of continuous development has a track record of less downtime to make a compelling argument to Enterprise IT.

  2. Ed Parsons

    Hi Stacey. Another reason why we can expect outages to not be as detrimental is that we are starting to see more tools developed to architect around the risk. For example, at Ilesfay (cloud based replication startup), our ZoneSync solution not only allows for the replication of persistent storage between cloud regions and between different clouds but also provides an API to consume the data from the best location. Take a look here:

  3. Dave Williams

    Would you like your ATM network or bank to adopt this model? I think not. It may work well for services that evolve quickly and whose SLAs tolerate small outages, but for those services where downtime penalties are your job (or worse) I think more mature models may be better.

  4. Lily Wilks

    Nice post talking about, Why you should expect more online outages but less downtime? BTW, if you are looking for fast and reliable replication and synchronization between cloud services like, Evernote Google Docs, please check out CloudHQ by clicking the link

  5. Aaron Rudger

    Continuous deployment or integration as a practice should not be confused with scalability and capacity planning. Application and infrastructure change to create capacity can be tested against production load to ensure it doesn’t impact customers. Outages *do* have a cost *and* are preventable. Seems short-sighted to sacrifice continuity for fast features when low-cost lead testing options are available to dev & ops teams.

  6. Kenny Mu Li

    This is assuming that availability issues occur on the software end, which is not necessarily the case, especially with hardware downtime in the cloud era for IaaS. With the hundreds of IaaS vendors out there, we see data center outages all the time, many of which house critical web-facing applications for clients.

    I agree with Ralph regarding the IT departments, but when you off-site IT responsibility to a facility managing thousands of other clients, sometimes things are just out of our control.

  7. Lacy McClain

    It seems the Sprint model is becoming more popular…being first to market is more important that high quality. The fact is we are in a I WANT IT NOW society and we would rather have an imperfect product than no product at all which means more outages but in response the are hiring more people to handle it. Anyone in IT should know this is not a good model and that costs increase the further the product is in the lifecycle. It seems companies would rather spend high dollars on support and maintenance than it design and test. I think that can work in certain industries such as telecom or entertainment but I would not want that to be the model operated under if you were the company in charge of building the car I put my children in or any industry that involves my money or my health.

  8. IT departments are essential to pretty much every business these days whether they be actively ‘on-line’ businesses or not. While I know it is not an easy thing to replicate, the efficiency and confidence displayed by the companies mentioned in the article should be aspired to by all IT departments. It’s a lot to aspire to but in trying to make the journey there things would get clearer, more efficient and hopefully satisfy your one corporate client ( ) which has got to be easier than millions of consumers. It is great to see how some organisations are so confident in their systems and response times to problems that they actively try and break them!