There is profound disagreement about whether live migration would have averted last month’s reboots of Amazon Web Services, Rackspace, and IBM Softlayer clouds. For the record, Amazon has said that it would not have done so. In an unscientific poll of a dozen or so experts who use and/or study AWS, some agree with that assertion, others do not. But there is some consensus that the public cloud leader is working on the technology for its massive cloud either way.
As background, live migration, a feature pioneered by [company]VMware[/company] in vMotion, lets you move a virtual machine (VM) from one host to another without having to shut it down. That’s obviously a critical feature for enterprise accounts — the likes of which are being wooed by [company]Amazon[/company], by [company]Google[/company], by [company]Microsoft[/company] and by VMware itself for their respective clouds. The traditional take is that VMware can do this because it controls the in-data-center technology with its vSphere dominance and going forward, it plans to offer live migration between in-house vSphere environments and outside vCloud Air implementations. But, it’s far harder for massively scaled cloud to offer similar capability.
As for the disagreement about live migration and the recent reboots, one camp holds that, live migration could have helped. If AWS moves one guest VM to another host, patches and reboots the old host, and returns the guest VM to the now-healthy host, problem solved! After all, doesnt’ AWS have a prodigious supply of hosts at its beck and call?
Others said this is far too simplistic. They agree that Amazon, in theory, could have done this but that this process would have been more time- and resource-intensive than the reboot itself. Live migration is nice, but no panacea, they argue.
Jesse Proudman, founder and CTO of Blue Box Group, a Seattle-based cloud provider said both AWS and Rackspace use local disk for virtual machine (VM) data and that is an issue. He said via email:
“With the data on-chassis, live migration is possible but it takes time: the data for each VM needs to be sent across the network. Evacuating a given host, particularly one at capacity can take hours. To live migrate an entire availability zone requires available capacity to migrate the VMs to. Cloud providers operate availability zones at capacities above 50% requiring migrations to be staggered host by host amongst the free capacity that’s available. When a provider must reboot an entire fleet in a short time window (say for a kernel breakout) and it takes a few hours per host and you can only migrate a subset of your fleet at a time, this results in mathematical equation that simply won’t balance.”
Google setting the bar on live migration
Having said all that, even people who think live migration would have been of limited use to AWS (or [company]Rackspace[/company] or [company]IBM[/company] SoftLayer) last month, they pretty much agree that AWS is working on it. Why? Because Google has it.
When Google announced plans for Google Compute Engine live migration last year (demonstrating it in March), it was pretty clear that AWS would have to respond. A year ago, Gartner analyst Lydia Leong blogged about how critical it is for large cloud providers to bring live migration to market and how she hoped Google’s move would raise the bar. She wrote:
Not only will Google’s addition of migration help data center maintenance, but more importantly, it will mitigate downtime related to host maintenance. Although AWS, for instance, tries to minimize host maintenance in order to avoid instance downtime or reboots, host maintenance is necessary — and it’s highly useful to have a technology that allows you to host maintenance without downtime for the instances, because this encourages you not to delay host maintenance (since you want to update the underlying host OS, hypervisor, etc.).
And she added:
VMware-based providers almost always do live migration for host maintenance, since it’s one of the core compelling features of VMware. But AWS, and many competitors that model themselves after AWS, don’t. I hope that Google’s decision to add live migration into GCE pushes the rest of the market — and specifically AWS, which today generally sets the bar for customer expectations — into doing the same, because it’s a highly useful infrastructure resilience feature, and it’s important to customers.
I’d be willing to bet AWS has been prepping its live migration response for some time, not that it would tell me. The big question is whether it will be ready to announce something at AWS Re:Invent next month. At that same show two years ago, Netflix CEO Reed Hastings put live migration on the top of his wish list for AWS features. And his words carry weight since Netflix is one of the largest users of AWS. And, AWS prides itself on being responsive to customer demand. So let’s see what happens next month in Las Vegas.
Last week, Google made some pretty big claims about how Live Migration helped it avoid issues from the Shellshock bug for its Google For Work users. Per a support email sent to customers and partners last week:
[…] Our Live Migration technology allowed us to transparently update and secure our host systems against ShellShock with zero fuss or fanfare, and we are building technology to make future ground-up infrastructure refreshes like this non-disruptive as well.
There’s a little bit of cognitive dissonance here since many of the Google applications under the Google for Work brand were hit by snafus on October 8 when many Google storage, Gmail, Hangouts and Postini users could not access their services. I’ve reached out to Google for clarification on this and will update this story when it’s forthcoming.
… and it’s the Structure Show!
On this week’s show, Derrick Harris and I gas on about Hewlett-Packard’s decision to morph from better-together One HP to even more better apart Two HPs and what that might mean for a combined EMC and HP (or maybe even a combined Rackspace and HP?) The general idea beingthat by offloading PCs and printers, the new HP Enterprise unit can merge with or acquire more enterprise-focused stuff. And we all know that [company]EMC [/company]Chairman Joe Tucci plans to retire. Some day. Or so rumor has it.
And this week’s guests, Ann and Bobby Johnson talk with Derrick about Interana, a startup they founded with [company]Facebook[/company] veteran Lior Abraham, to bring [company] Facebook [/company]style analytics to the rest of the world. Abraham built Facebook’s SCUBA data-analysis system, which was built for use by the company’s engineers to analyze server performance but has since been adopted by most of the masses at Facebook.
Hosts: Barb Darrow and Derrick Harris