6 Comments

Summary:

Without thinking about the SLA it’s impossible to compare costs between clouds. An enterprise gigabyte comes with a whole host of services around it, while Amazon’s does not. So while it seems like cost per gigabyte is a good measure, it’s not. Here’s why.

istock_000001518945xsmall

Like an annoying pop song, Service Level Agreements (SLAs) have been on my mind for a while. In most cases, they are not worth the digital bits sent to serve them up on the web. Yet they are important if we are comparing cloud service providers, because without taking into account an SLA, your business may spend far more time and effort trying to engineer around failures to prop up an inexpensive cloud.

Recently, I received an e-mail comparing a customer’s internal storage costs to Amazon’s. Of course, Amazon seems to be cheaper based on a pure gigabyte comparison. But it was a flawed analysis because it didn’t include the service level promised, never mind guaranteed.

Without that it’s impossible to compare costs. The enterprise gigabyte comes with a whole host of services around it, while Amazon’s does not. IT groups are terrible at explaining their offerings, so it seems like cost per gigabyte is a good measure, but it’s not. Here’s why.

Amazon’s Elastic Block Storage service has huge variations in performance that make the service range from good enough to completely useless. It’s so variable that it takes a lot of work to architect around it. This post by Orion and Mdash of Heroku delves into some of the work they did to get acceptable performance from EBS, such as use lots of disks, larger buffers, right file-systems, larger chunk sizes on the RAID, etc. Maybe they toss in a little unicorn powder.

I was struck by how much work they did as storage admins. That costs money in labor, hiring, and training. Add backups, multi-zone, multi-datacenter and the dollar-to-dollar comparison between private storage and public storage begins to be a fairer fight.

This is not a knock on EBS, which I use. But I did learn that when you use a cloud service, you always accept a service level, whether explicit or implied, and that brings a whole bunch of labor, labor that could be avoided if you consider someone offering a higher service level, instead of cutting them out because their gigabyte pricing is too high.

Getting to better SLAs.

To improve SLAs we need to work on two questions. First, what are the operational expectations I should have in terms of reliability, availability, performance, security? Unlike Heroku,, who owns their own code, an enterprise customer can’t rewrite Oracle or SAP or their existing 1000+ internal applications. Therefore, specific operational expectations are very important to determine the suitability of the infrastructure for a specific application. The Amazon service catalog description for EBS performance doesn’t provide any expectation, and Amazon doesn’t commit to any service level that allows the customer to carry out operations they might need to implement in order to recover from failures.

Below, for example, is a potential way to set expectations: service catalog options. Each offering has different levels and costs. The customer can easily see that Tier 1 Windows includes higher level support, while Tier 4 does not include support. The customer can select and make a trade-off between service level and labor that works for them.

The second question is related to the visibility and tooling for recovery that a customer has. I agree with Christian Reilly that the legalese in SLAs is useless. They are biased in favor of the vendor, the penalties (if any) cannot hope to equal the value lost and they’re operationally useless. It’s this last bit that better SLAs can change, and in doing so make SLAs operationally useful.

For example, EBS provides replication, but I can’t see or manipulate the replica, nor is there a service request a user can make to get a copy. Therefore, operationally speaking, the customer is in charge making replicas, snapshots, and back ups for when EBS fails. And cloud is made by humans. Trucks hit generators. Lighting strikes. And that means the cloud will fail and the customer may not have any recourse to have prevented his data from getting lost.

There are other cloud providers who do provide back ups, mirroring, snapshots and the tooling. And those services cost more because the cloud provider adds more labor, know-how and resources to deliver that service level. Otherwise, the customer is on the hook for that labor, which adds to that gigabyte cost comparison.

New SLAs mean new types of clouds.

In conclusion, the road to better service level agreements will require the vendor to articulate the service offering and expected performance clearly in a way that is visible, actionable by the customer and with clear metrics for recovery. So if the service includes back up, the customer understands the scope, can see when the backup failed, and can on their own recover from failure within the promised window (mean time to recover).

New SLAs will likely lead to two types of clouds –or at least make this bifurcation of clouds more visible. One cloud will be for apps designed for failure, scale out and mobility: perfect for single app startups coming out of Silicon Valley and some new field applications. Which makes sense because startups have more labor than money.

The other is built to bring existing enterprise applications into a more cloud-like operating model; meaning transitioning existing applications and workloads to be more on-demand, elastic and pay-per-consumption, using either a private cloud or an enterprise cloud provider. I call them “city clouds” because they have to work within existing infrastructure, follow rules, customs, and are constrained by other applications. And thanks to better SLAs customers can know what they are choosing and price that into their estimates.

Rodrigo Flores is a cloud enterprise architect at Cisco Systems and the former founder and CTO of newScale, which was acquired by Cisco in April 2011.

You’re subscribed! If you like, you can update your settings

  1. This is something I’ve said/taught clients forever.

    In an SLA, if an ISP guarantees 99% uptime, that still leaves room to be down for 7.2 hours per month. If you had a website down for that many hours each month, you’d leave. So now push it to 99.9%. it seems like a pretty good barometer to “only” be allowed to be down for 45 minutes per month, but pick the wrong 45 minutes in the wrong configuration and you’re out of business. And none of that even accounts for what “down” MEANS.

    Or look at OCR software and do the same math. 99.9% accuracy means 2.5 errors per page.How’s that useful?

    It’s all about client expectations and relationships, folks. Thanks for a great piece!

    Jeff Yablon
    President & CEO
    Answer Guy and Virtual VIP Computer Support, Business Change Coaching and SEO Consulting/Search Engine Optimization Services

  2. The most hardcore “Pragmatist” amongst us will still say, “I don’t care about any of that. I just want it (service or product) to work!”

    1. I would have thought a pragmatist would know that every product or service fails to work at times. SLAs are simply a way for savvy buyers to define how much they are willing to pay for less failure. They are thinking about economic impact, http://www.actual-experience.com/blog/?p=765.

  3. Effective Service Level Agreement (SLA) is absolutely vital and extremely important to any organization to ensure effective engagements for any service or product delivery. A well-defined SLA records the expectations for both sides of the relationship and provides targets for accurately measuring performance against those objectives. Read more here: http://blog.maia-intelligence.com/2008/06/05/importance-of-an-sla-for-any-organization/

  4. Eelco van Beek Monday, August 22, 2011

    The thing with most SLA’s is that they’re not specifically meant for just a part of a service chain, because in that case they’re only useful as an indicator for an architect or engineer to decide how redundant that specific part should be in the complete architecture – no guarantee of end-user quality. Cloud services, specifically IaaS are in the end just tools and components. They still need to be glued together in something useful and trustworthy for the end-user. Not to shamelessly promote our company but we delivery SLA’s for the full chain, with or without Cloud components. I’m my opinion those kind of SLA’s still matter most.

  5. Great article and so true… I believe there is one other element that is very important, omne of trust, do you trust your provider, have they a fantastic reputation ?. No point in a iron clad SLA if a “business” charges a fortune and in event of failure doesnt have resourse and know how to fix.. Its all about partnership.. you wouldnt put all your money in an offshore account hoping it will be safe and managed well without some level of understanding of the providers capabilities and history, why do it with your valuable data.

Comments have been disabled for this post