Blog Post

Amazon and Rackspace must be wishing they had live migration right now

Two major cloud providers — Amazon Web Services and Rackspace — had to scramble this weekend to reboot a good chunk of their customers’ compute instances due to a mysterious but apparently critical Xen hypervisor issue.

As it became evident that Xen was the heart of the matter, cloud experts (and VMware marketing people) quickly weighed in to say that live migration capability — which allows virtual machines (VMs) to be moved between physical machines without being shut down, would have mitigated a lot if not all of this trauma. With live migration, the cloud provider could move the guest VM onto another host, and reboot the old host. But enabling that feature is easier said than done or everyone would do it.

As of now however, just [company]VMware[/company] and [company]Google[/company] do so. Google built its live migration and related maintenance capabilities from the ground up. VMware can move VMs between servers running in vSphere environments and will add that capability to  vCloud Air over time. It’s somewhat easier for VMware to do this because it controls both sides of the transaction. Amazon and Rackspace public clouds rely on a legacy (read: old) hypervisor in Xen and use older cloud orchestration technology, which mandates that they have to do the requisite patching, said Gigaom Analyst MSV Janakiram.

AWS could be working on live migration, perhaps for an AWS re:Invent reveal in November. But if it is we wouldn’t expect the company to say so and it did not. Asked about any plans for live migration, a spokeswoman said via email that the company continuously maintains its cloud to avoid problems but that in any case live migration per se is no silver bullet. She wrote:

Even in this case, we were able to do the vast majority of the maintenance without any customer impact. There will sometimes be cases where the specifics require that we do reboots regardless of the various maintenance capabilities we have — and we have many. There is no single capability, including live migration, that can guarantee zero customer impact. In this case, there was no way for us to avoid rebooting less than 10% of the fleet. Thus far, on the current maintenance activity, we have completed most of the instance reboots and have seen little customer impact.

(The emphasis is mine.)

Rackspace did not respond to requests for comment for this story, but I suspect we have not heard the end of this occurrence and the need for more seamless workload transfers in a customer’s cloud of choice. And,  perhaps (dare we hope?) even between clouds from multiple vendors.

Reset button

Feature photo courtesy of Flickr user Phil Roeder

10 Responses to “Amazon and Rackspace must be wishing they had live migration right now”

  1. What about leveraging a network technology to ensure the data is moved from one host to another without the customer having any sort of interruption? High speed InfiniBand makes this possible and currently in production over at ProfitBricks. Check it out!

  2. Jimbo Jones

    Erm…..I think this article is probably slightly mistaken. I bet they DO have live migration (Xen has supported it for a long time), however, they had to patch by 1st Oct and live migrating millions of VMs would have 1. Taken a very long time and a lot of manpower and 2. Required masses and masses of network bandwidth which possibly isn’t available.

  3. Freecounty

    Migration assumes a destination that would have been already updated. Even with “live migration”, the reboot would still happen if the hypervisor update (Xen) affects all hosts while there aren’t enough idle updated hosts for the number of guest machines.

  4. oh…VM live migration only works for the most trivial workloads, for mission critical applications that matter it just doesn’t work without crazy side effects.

    So, better to failover properly and forget about live migration

  5. That was my first thought when I heard of the AWS and rackspace announcements. But if you sit back and think about it, it is not a big deal. Cloud computing today has moved forward from VPS, VM’s and most customers in the cloud don’t have an architecture that is reliant on one server. Live migration can make you lazy sometimes – Enterprise IT still has architectures where some servers cannot have downtime. I think in the long run this sort of architecture in the public cloud today is better for everybody.

    • The widespread negative reaction from actual AWS and Rackspace customers around the reboot demonstrates how few apps are architected, coded and tested to withstand a “chaos monkey” reboot like this.

      • That’s because majority of so called “software engineers” wont be able to engineer their way out of a paper bag if the paper bag was made from a material slightly different from the one they designed their method for.