Blog Post

Is live migration coming to Amazon Web Services? Smart money says yes.

There is profound disagreement about whether live migration would have averted last month’s reboots of Amazon Web Services, Rackspace, and IBM Softlayer clouds. For the record, Amazon has said that it would not have done so. In an unscientific poll of a dozen or so experts who use and/or study AWS, some agree with that assertion, others do not. But there is some consensus that the public cloud leader is working on the technology for its massive cloud either way.

As background, live migration, a feature pioneered by [company]VMware[/company] in vMotion, lets you move a virtual machine (VM) from one host to another without having to shut it down. That’s obviously a critical feature for enterprise accounts — the likes of which are being wooed by [company]Amazon[/company], by [company]Google[/company], by [company]Microsoft[/company] and by VMware itself for their respective clouds. The traditional take is that VMware can do this because it controls the in-data-center technology with its vSphere dominance and going forward, it plans to offer live migration between in-house vSphere environments and outside vCloud Air implementations. But, it’s far harder for massively scaled cloud to offer similar capability.

As for the disagreement about live migration and the recent reboots, one camp holds that, live migration could have helped. If AWS moves one guest VM to another host, patches and reboots the old host, and returns the guest VM to the now-healthy host, problem solved! After all, doesnt’ AWS have a prodigious supply of hosts at its beck and call?

Others said this is far too simplistic. They agree that Amazon, in theory, could have done this but that this process would have been more time- and resource-intensive than the reboot itself. Live migration is nice, but no panacea, they argue.

Jesse Proudman, founder and CTO of Blue Box Group, a Seattle-based cloud provider said both AWS and Rackspace use local disk for virtual machine (VM) data and that is an issue. He said via email:

“With the data on-chassis, live migration is possible but it takes time: the data for each VM needs to be sent across the network. Evacuating a given host, particularly one at capacity can take hours. To live migrate an entire availability zone requires available capacity to migrate the VMs to. Cloud providers operate availability zones at capacities above 50% requiring migrations to be staggered host by host amongst the free capacity that’s available. When a provider must reboot an entire fleet in a short time window (say for a kernel breakout) and it takes a few hours per host and you can only migrate a subset of your fleet at a time, this results in mathematical equation that simply won’t balance.”

 

Google setting the bar on live migration

Having said all that, even people who think live migration would have been of limited use to AWS (or [company]Rackspace[/company] or [company]IBM[/company] SoftLayer) last month, they pretty much agree that AWS is working on it. Why?  Because Google has it.

When Google announced plans for Google Compute Engine live migration  last year (demonstrating it in March), it was pretty clear that AWS would have to respond. A year ago,  Gartner analyst Lydia Leong blogged about how critical it is for large cloud providers to bring live migration to market and how she hoped Google’s move would raise the bar. She wrote:

Not only will Google’s addition of migration help data center maintenance, but more importantly, it will mitigate downtime related to host maintenance. Although AWS, for instance, tries to minimize host maintenance in order to avoid instance downtime or reboots, host maintenance is necessary — and it’s highly useful to have a technology that allows you to host maintenance without downtime for the instances, because this encourages you not to delay host maintenance (since you want to update the underlying host OS, hypervisor, etc.).

And she added:

VMware-based providers almost always do live migration for host maintenance, since it’s one of the core compelling features of VMware. But AWS, and many competitors that model themselves after AWS, don’t. I hope that Google’s decision to add live migration into GCE pushes the rest of the market — and specifically AWS, which today generally sets the bar for customer expectations — into doing the same, because it’s a highly useful infrastructure resilience feature, and it’s important to customers.

I’d be willing to bet AWS has been prepping its live migration response for some time, not that it would tell me.  The big question is whether it will be ready to announce something at AWS Re:Invent next month. At that same show two years ago, Netflix CEO Reed Hastings put live migration on the top of his wish list for AWS  features. And his words carry weight since Netflix is one of the largest users of AWS. And, AWS prides itself on being responsive to customer demand. So let’s see what happens next month in Las Vegas.

Last week, Google made some pretty big claims about how  Live Migration  helped it avoid issues from the Shellshock bug for its Google For Work users. Per a support email sent to customers and partners last week:

[…] Our Live Migration technology allowed us to transparently update and secure our host systems against ShellShock with zero fuss or fanfare, and we are building technology to make future ground-up infrastructure refreshes like this non-disruptive as well.

There’s a little bit of cognitive dissonance here since many of the Google applications under the Google for Work brand were hit by snafus  on October 8 when many Google storage, Gmail, Hangouts and Postini users could not access their services. I’ve reached out to Google for clarification on this and will update this story when it’s forthcoming.

 

… and it’s the Structure Show!

 

On this week’s show, Derrick Harris and I gas on about Hewlett-Packard’s decision to morph from better-together One HP to even more better apart Two HPs and what that might mean for a combined EMC and HP (or maybe even a combined Rackspace and HP?) The general idea beingthat by offloading PCs and printers, the new HP Enterprise unit can merge with or acquire more enterprise-focused stuff. And we all know that [company]EMC [/company]Chairman Joe Tucci plans to retire. Some day. Or so rumor has it.

johnsons

And this week’s guests, Ann and Bobby Johnson talk with Derrick about Interana, a startup they founded with [company]Facebook[/company] veteran Lior Abraham, to bring [company] Facebook [/company]style analytics to the rest of the world. Abraham built Facebook’s SCUBA data-analysis system, which was built for use by the company’s engineers to analyze server performance but has since been adopted by most of the masses at Facebook.

 

 

SHOW NOTES

Hosts: Barb Darrow and Derrick Harris

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

6 Responses to “Is live migration coming to Amazon Web Services? Smart money says yes.”

  1. Cloud Insider

    Check @blueboxjesse’s article on why Live Migration does not work well in these scenarios. Hot patching is the better alternative here. AWS has hot patching on most servers. The few that didn’t have it (and required a reboot) are being upraded to support hot patching.

  2. bmccallion

    When I hear about things like Live Migration, I’m always glad to hear about the technology. And yet I perceive a marked difference between being customer centric and serving impulses that seemingly enable customers to distance themselves from the work to be done. Live migration strikes an odd resonance. It seems like a convenience, yet some of the opportunity in Cloud Adoption is in doing the work of automating instance builds, and learning how to write applications that are self-healing, and where the application code adapts to changes in the infrastructure. I really like things simple, and my approach is to build with web services first, and only configure and run what I can’t find in the Cloud. While perhaps not obviously related I appreciate Reserved Instances as opposed to Long Running instances where the discounts kick-in if an instance keeps running long enough. I certainly understand why pople grouse about the analysis and decisions to be made with respect to high availability and yes, Reserved Instances. Yet in each case doing the work empowers the customer. One of the significant departures between Cloud and Legacy vendors is the degree of control and empowerment available to customers. Customers may not be fully comfortable taking on responsibility for the availability of their applications, yet I hear many lament how Cloud somehow replaces the rock solid 5 9’s of their
    two data centers separated by the Hudson with an inferior SLA. As customers of Cloud my suggestion is that we welcome opportunities to take back control of our applications and infrastructure. I’m sure Live Migration is amazing but for customers “doing it right” it may be useful. For customers at the scale of Neflix, the need is probably there. But Netflix has done all the hard work along the way and is stronger for it.