Like an annoying pop song, Service Level Agreements (SLAs) have been on my mind for a while. In most cases, they are not worth the digital bits sent to serve them up on the web. Yet they are important if we are comparing cloud service providers, because without taking into account an SLA, your business may spend far more time and effort trying to engineer around failures to prop up an inexpensive cloud.
Recently, I received an e-mail comparing a customer’s internal storage costs to Amazon’s. Of course, Amazon seems to be cheaper based on a pure gigabyte comparison. But it was a flawed analysis because it didn’t include the service level promised, never mind guaranteed.
Without that it’s impossible to compare costs. The enterprise gigabyte comes with a whole host of services around it, while Amazon’s does not. IT groups are terrible at explaining their offerings, so it seems like cost per gigabyte is a good measure, but it’s not. Here’s why.
Amazon’s Elastic Block Storage service has huge variations in performance that make the service range from good enough to completely useless. It’s so variable that it takes a lot of work to architect around it. This post by Orion and Mdash of Heroku delves into some of the work they did to get acceptable performance from EBS, such as use lots of disks, larger buffers, right file-systems, larger chunk sizes on the RAID, etc. Maybe they toss in a little unicorn powder.
I was struck by how much work they did as storage admins. That costs money in labor, hiring, and training. Add backups, multi-zone, multi-datacenter and the dollar-to-dollar comparison between private storage and public storage begins to be a fairer fight.
This is not a knock on EBS, which I use. But I did learn that when you use a cloud service, you always accept a service level, whether explicit or implied, and that brings a whole bunch of labor, labor that could be avoided if you consider someone offering a higher service level, instead of cutting them out because their gigabyte pricing is too high.
Getting to better SLAs.
To improve SLAs we need to work on two questions. First, what are the operational expectations I should have in terms of reliability, availability, performance, security? Unlike Heroku,, who owns their own code, an enterprise customer can’t rewrite Oracle or SAP or their existing 1000+ internal applications. Therefore, specific operational expectations are very important to determine the suitability of the infrastructure for a specific application. The Amazon service catalog description for EBS performance doesn’t provide any expectation, and Amazon doesn’t commit to any service level that allows the customer to carry out operations they might need to implement in order to recover from failures.
Below, for example, is a potential way to set expectations: service catalog options. Each offering has different levels and costs. The customer can easily see that Tier 1 Windows includes higher level support, while Tier 4 does not include support. The customer can select and make a trade-off between service level and labor that works for them.
The second question is related to the visibility and tooling for recovery that a customer has. I agree with Christian Reilly that the legalese in SLAs is useless. They are biased in favor of the vendor, the penalties (if any) cannot hope to equal the value lost and they’re operationally useless. It’s this last bit that better SLAs can change, and in doing so make SLAs operationally useful.
For example, EBS provides replication, but I can’t see or manipulate the replica, nor is there a service request a user can make to get a copy. Therefore, operationally speaking, the customer is in charge making replicas, snapshots, and back ups for when EBS fails. And cloud is made by humans. Trucks hit generators. Lighting strikes. And that means the cloud will fail and the customer may not have any recourse to have prevented his data from getting lost.
There are other cloud providers who do provide back ups, mirroring, snapshots and the tooling. And those services cost more because the cloud provider adds more labor, know-how and resources to deliver that service level. Otherwise, the customer is on the hook for that labor, which adds to that gigabyte cost comparison.
New SLAs mean new types of clouds.
In conclusion, the road to better service level agreements will require the vendor to articulate the service offering and expected performance clearly in a way that is visible, actionable by the customer and with clear metrics for recovery. So if the service includes back up, the customer understands the scope, can see when the backup failed, and can on their own recover from failure within the promised window (mean time to recover).
New SLAs will likely lead to two types of clouds –or at least make this bifurcation of clouds more visible. One cloud will be for apps designed for failure, scale out and mobility: perfect for single app startups coming out of Silicon Valley and some new field applications. Which makes sense because startups have more labor than money.
The other is built to bring existing enterprise applications into a more cloud-like operating model; meaning transitioning existing applications and workloads to be more on-demand, elastic and pay-per-consumption, using either a private cloud or an enterprise cloud provider. I call them “city clouds” because they have to work within existing infrastructure, follow rules, customs, and are constrained by other applications. And thanks to better SLAs customers can know what they are choosing and price that into their estimates.
Rodrigo Flores is a cloud enterprise architect at Cisco Systems and the former founder and CTO of newScale, which was acquired by Cisco in April 2011.