6 Comments

Summary:

James Urquhart kicks off a discussion about system resiliency by outlining the key concepts — devops, complex adaptive systems and anti-fragility — that affect it in the cloud computing era.

Some time ago, my friend Phil Jaenke and I (and a few others) got into a debate on Twitter. The discussion started as an exploration of the changing nature of software development, operations and change control, and whether they are good or bad for the future of software resiliency. It resulted in a well-articulated post from Phil, arguing that you can’t have resiliency without stability, and vice versa.

However, as I started trying to outline a response, I realized that there was a lot of ground to cover. The core of Phil’s argument comes from his background as a hardware and systems administration expert in traditional IT organizations. And with that in mind, what he articulates in the post is a reasonable way to see the world.

However, cloud computing is changing things greatly for software developers, and these new models don’t take kindly to strict control models. From an application down perspective, Phil’s views are highly suspect, given the immense success of companies like Etsy and Netflix (despite their recent problems) have had with continuous deployment and engineering for resiliency.

Reconciling the two views of the world means exploring three core concepts required to understand why a new IT model is emerging, but not necessarily replacing everything about the old model.

The first of these concepts is devops, which earned its own three-part series from me a few years ago, and has since spawned off its own IT subculture. The short, short version of the devops story is simple: modern applications (especially on webscale, or so-called “big data” apps) require developers and operators to work together to create software that will work consistently at scale. Operations specifications have to be integrated into the application specifications, and automation delivered as part of the deployment.

In this model, development and operations co-develop and very often even cooperate, thus the term devops.

The second concept is one that I spoke about often in 2012: complex adaptive systems. I’ve defined that broad concept in earlier posts, but the stability-resiliency tradeoff is a concept that is derived from the study of complex adaptive systems. Understanding that tradeoff is critical to understanding why software development and operations practices are changing.

The third concept is that of anti-fragility, a term introduced by Nassim Nicholas Taleb in his recent book Anti-fragile: Things That Gain from Disorder. Anti-fragility is the opposite of fragility: as Taleb notes, where a fragile package would be stamped with “do not mishandle,” an anti-fragile package would be stamped “please mishandle.” Anti-fragile things get better with each (non-fatal) failure.

Although there are elements of Taleb’s commentary that I don’t love (the New York Times review linked to above covers the issues pretty well), the core concept of the book is a critical eye-opener for those trying to understand what cloud computing, build automation, configuration management automation and a variety of other technologies are enabling software engineers to do today that were prohibitively expensive even 10 years ago.

So, over the next few weeks, I will try to explore these concepts in greater detail. Along the way, I will endeavor to address Jaenke’s concerns about the ways in which these concepts can be misapplied to some IT activites.

Please join me for this exploration. Use the comments section to push back when you think I am off-base, acknowledge when what I say matches what you have experienced, and, above all, how you think about how your organization and career will change one way or another.

As always, I can be found on Twitter as @jamesurquhart.

Feature image courtesy of Shutterstock user Sinisa Botas.

You’re subscribed! If you like, you can update your settings

  1. Looking forward to this new series :)

  2. Eamonn Colman Monday, January 14, 2013

    Love the photo you’ve chosen to kick-start the series. I’m guessing one of the biggest challenges might be cost-effectiveness. If I wanted to build an ‘anti-fragile’ house of cards I might need a vacuum sealed chamber in zero-G where I can blow on it without harm.

  3. the reality here is that developers want to be seen as being cool and relevant again so standards, processes and COTS products are seen as boring and inhibitors. Just look at the Netflix model, for all their supposedly “cool stuff” their business model sucks and they still don’t work on Amazon properly. One of the big issues is that people believe that cloud bring individualism and the opportunity to be unique. While I agree with that to a point, once cloud matures it will be something that actually delivers and works properly….

    I work for an internet company and for all the devops, code and cool stuff, all I want is stability and interoperability across all products so they work…I long for the day when I can buy that as a canned product…..Today I have to ego massage lots of developers telling me couchbase is cooler than cassandra….Wrong Conversation!

  4. cloudierthanthou Tuesday, January 15, 2013

    Netflix’s Chaos monkey is a manifestation of the stability-resiliency tradeoff — giving continuous small shocks to the system ensures a greater likelihood that it will survive the big one.

  5. Steve, I am sure _some_ developers do or want to do devops because it is cool. But the main job of a developer is to automate things. And the only truly repeatable processes are automated ones. It means spending less time on overhead and more on bringing true value to the business. Processes, standards and COTS are inhibitors if they cannot be “automated” – not just to developers, but mostly to the business. And yes, they _are_ boring.

    I just estimated a task. Because we dont have automated processes we estimated 8 hours and the majority of it is administrative overhead. The actual task would take a few minutes. No multiply that by the number of times it needs to be done (which is a bunch).

    As for Netflix, why does their business model suck? If it does, why does RedBox want in?

    And what do you mean it still doesn’t work properly on AWS? It seems to work pretty “properly” to me? Will there be issues? Yup. Then they get it fixed. Guess what? Internally hosted systems have issues too. And they typically end up being bigger because we/us/you/them don’t have the level of quality that someone like Netflix has.

    Q: What is an “internet company”?

  6. Just a quick note; due to external factors, namely DNS going splat, the corrected link for my post is: http://www.rootwyrm.com/2012/12/stability-resilience-not-stability-resilience/
    Thanks all! Looking forward to discussing this a lot more!
    -Phil

Comments have been disabled for this post