Blog Post

The VMotion Myth

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Among the many innovations that virtualization has brought to the data center is server mobility, or the ability to live-migrate virtual machines (VMs) across physical servers. With it comes a marketing story that dynamically moving VMs inside a single data center or between two data centers is a seamless process. While at some point that will undoubtedly be true, it’s far from an operational reality today. In the meantime, there are numerous opportunities for startups to offer solutions that will help make such seamlessness a reality.

Currently, moving a VM from a one physical machine to another has two important constraints. First, both machines must share the same storage back end, typically a Fibre Channel/iSCSI SAN or network-attached storage. Second, the physical machines must reside in the same VLAN or subnet. This means that inside a single data center, one can only move a VM across a relatively small number of physical machines. Not exactly what the marketing guys would have you believe.

While this might be a small inconvenience to an enterprise data center, think about it from the perspective of a cloud provider. Their entire reason for existence is to maximize revenue earned from every dollar spent on servers. If the network limits their server utilization by constraining VM mobility, then it’s costing them revenue. Maybe they should try asking for that money back from their current networking vendor?

Many networking vendors talk about the flattening of the data center network as a cure-all, and would advise simply building a very large VLAN. But such an approach has a host of problems that have been around for years and are why routers are deployed in the first place: limiting multicast traffic, Spanning trees issues with multipathing, and of course the fact that network operations staff use VLANs for very real purposes like segmenting traffic for security, compliance, PCI, etc. These and a number of other limitations of current data center networks are on the radar of vendors and standards bodies; meanwhile, the research community has responded with lots of papers with funny names like VL2 (PDF) and PortLand (PDF), among others. What we need now are startups to come up with some creative, and practical, solutions to these problems. For if networking continues to play second fiddle to compute and storage, the cloud vision will never be fully realized.

And while server mobility inside a single data center is tough, inter-datacenter and data center-to-the-cloud server mobility is even tougher. Cisco and VMware have published (PDF) papers about it a few times, but the solutions they’re proposing seem more like hero experiments than practical solutions at this point.

For long-distance applications storage becomes the big problem, which EMC highlighted just last month with the release of VPLEX (GigaOM Pro, sub req’d). It turns out that replicating VM data across the WAN is doable, but really expensive and even more bandwidth-consumptive. Also, if you’re using Fibre Channel there are distance limitations due to the FC protocol. Oh, and with the various flavors of storage over IP, you’d better not have any packet loss. On the other hand, you could chose to move just the VM state while keeping storage in the original data center, but that will impact application performance. In other words, when all is said and done, getting this to work is extremely complicated, and probably only feasible if money is no object.

The typical use cases cited for VMware’s LD VMotion are for disaster recovery and avoidance, workload balancing, etc. The concept of moving from disaster recovery to disaster avoidance is a compelling shift and can add additional layers of reliability on top of existing features from VMware, including High Availability and Fault Tolerance. There seems to be a clear need for inter-data center and private-to-public cloud migration. And those use cases –- which I believe are set to take off –- should be the real call to arms to make such a capability operationally efficient and much less expensive.

Server mobility is a powerful tool in the modern data center, but it currently has numerous limitations. In these limitations I see opportunity for startups, especially when it comes to fixing the networking issues and helping to decrease the storage costs. Let me know if you do, too.

To hear more about server mobility and similar topics, attend Structure on June 23 & 24 in San Francisco.

Alex Benik is a principal at Battery Ventures

25 Responses to “The VMotion Myth”

  1. For engineers like me who rarely have direct exposure to datacenters, provisioning and that side of IT, would someone explain to us what kind of applications are hosted on these.

    I’ve listed down simple questions and described a pretty common deployment scenario in the Java/JEE world. I’d greatly appreciate answers (no flaming pls, we are noobs):

    1) The app running on the app server uses 80-90% CPU for 8 hrs a day
    2) The app is really a cluster of let’s say 12 processes running on 4 machines. App server already does FT/ HA and all that stuff. That’s why it’s a cluster
    3) Each process has been allocated and is using 3-4G of RAM
    4) Yes, these are JVMs
    5) 1G switch

    Given these basic stats,
    1) Would you put this setup on VMs? (Not to be confused with JVMs)
    2) Obviously this vmotion has to use CPU. Someone has to be syncing the deltas
    3) Again, it is also using the network and bandwidth
    4) Are you saying there is no performance impact on the application? No context switching, cache pollution, interruption, bus contention

    Or, is there a little elf sitting inside who can magically make this happen without impacting performance and guarantee 0 downtime?

    Explicitly clustered apps like App servers, Grids and NoSQL software and other such apps that people have been using for years, provide many knobs to suit the user’s combination of performance/FT/HA etc. They do not hide this – nor this – from users/developers.

    Am I right or am I right? Perhaps someone can clarify such things for us? (without just asking us to RTFM pls)


  2. Alex Benik

    Thanks for the spirited feedback. Many of the comments focused on my use of the term myth in the title. I certainly agree that VMotion isn’t a myth. It is extremely valuable tool used by many customers. The myth I was alluding to is that these VMs are completely mobile without constraints. I just wanted to point out some of the limitations and let it be known that I think there are interesting businesses to be built in helping enterprises and cloud providers extend the envelop of the VMotion use case.

  3. This must be the same guy testing tier 1 apps on a 1998 laptop using VMware Workstation and calling the vendor out for pour performance. Seems like another scare tactic to me.

  4. Glynn Seymour

    I’m assuming he is referring to the class C subnet used for cluster interconnect and passing vMotion traffic.

    Having a choice of only 253 possible physical host destinations is hardly limiting. Having some of those nodes closedown when not needed and power up when demand dictates, and having workloads move SEAMLESSLY between them during this process is as near to magic as I’ve seen in my time.

    Not using shared storage? You should probably know why you aren’t and what limitations this imposes. These days, I can see only a few reasons to do so – server consolidation to a single node with DAS would increase your risk IMHO unless you’re binning some archaic hardware to do so and reamin in support.

    As for the ‘myth’ of vMotion – guess I must have been imagining it for years.

  5. I just read this blog and thought “WOW! He has no clue what he is talking about!”

    As a VMware Certified Professional (VCP) for the past 4 years, you are completely wrong here. Your “myth” is a result of a complete lack of understanding of networking as a whole. I would strongly suggest you read the most important (IMO) whitepaper from VMware on this subject. This whitepaper is the foundation of a resilient virtual instrastructure:

    VMware ESX Server 3: 802.1Q VLAN Solutions

  6. James Streit

    Don’t forget about the 3rd constraint … CPU. If you want to do a vMotion of a machine that is in a running state, even if you have the shared storage and the network taken care of, you can’t move a machine from an Intel platform to an AMD platform.

    • Mike G

      Very nice insight in your comments. Thoughtful and helpful. Too bad you don’t know the English language. I am afraid that you don’t know what you’re talking about.

  7. I have a customer with more than 240 hosts in their datacenter, and with trunked ports as Scott Lowe described plus svMotion as described in the comments here, I am able to move a running VM to any of them. Hardly a “relatively small number of physical machines” as far as I’m concerned.

  8. “First, both machines must share the same storage back end, typically a Fibre Channel/iSCSI SAN or network-attached storage.” <– This is a half-truth, yes, both physical machines must see the same storage back end, but that storage back-end can be multiple and different physical SAN’s. VMWare has Storage vMotion which allow live copy of a VM from one physical array to a completely different physical array w/ no downtime.

    I’m rather curious where you’ve received all this information. VMotion is not a myth.
    “Second, the physical machines must reside in the same VLAN or subnet.” <– Not true, the physical machines can all reside on different VLAN’s, but best practice says to isolate those VLAN’s so that no other traffic supercedes moving VM’s around. It’s also a good security measure.

    As for the matter of Disaster Recovery, there are solutions for that in VMWare’s portfolio as well as several other vendors to mirror the current environment across to anywhere, but yes cost could be a huge factor unless done correctly. Technologies such as deduplication from NetApp or EMC’s Celerra, can reduce the amount of SAN space needed to reduce the overall amount of data that needs to be transferred over the network.

  9. Andrew Miller

    Hmm….I agree that there boundaries around vMotion design but practically speaking they’re high enough to leave a lot of flexibility (especially in enterprise or even small business datacenters).

    Speaking as an architect/engineer who just implemented a full-on “VMware datacenter in a box” last week (storage + servers + vSphere), vMotion is definitely a reality even in shops with (4) ESX hosts — we had it up and going on day 2 actually (a good bit of day 1 was whiteboard time around VLANs and IP range layout….data center design really that should be done in any virtualized environment).

    The wonderful thing about vMotion (and DRS really, i.e. dynamic vMotion) is that you can stop thinking as much about individual physical hardware and just consider it to be a “pool of resources”….if you need more resources, add another physical box into the pool.

    I agree there are limitations (show me tech without it) but I have many customers who are using vMotion/DRS with fantastic results…..

  10. vMotion has a goal of keeping the VM powered on and serving requests while it is transitioned between physical hosts. I am not sure what route you are wanting to take if the same IP network doesn’t exist for that VM on the other side of the move?

    Are you looking to have the IP stack of the VM dynamically adjust and re-register (thinking dynamic DNS with extremely low TTLs) so that the requests can be processed after the move? If that is the case, why not just spawn more instances of a stateless VM in the cloud and use network load balancers to deal with new TCP or UDP requests for the services that the particular VM cluster processes.

    So for clarification, do you assume that the vMotion myth applies to VMs being moved in a powered ON or OFF state?

    As a side, I am all about SRV records being used more with low TTLs.. still find it hard to believe it isn’t more widely adopted with standard internet protocols.

  11. I think you are overestimating the limitation of VMotion within the same data center. All you need to do is make sure the VLAN you need for the virtual machine is trunked to all esx hosts in your cluster and configure your virtual switch/distributed virtual switch/nexus 1000v accordingly. This is really not a big deal and trivial to setup and maintain. I do this everyday. As a matter of fact, while I was writing this response a few of my production virtual machines moved around seemlessly in my DRS cluster.

    The bigger issue is across data centers. The problem is not so much a VMware issue but a general networking problem/best practice and also a real hard storage problem as you have stated. The network problem has recently been sorta solved by Cisco with OTV. The problem is truly a spanning tree issue. With L2/L3 being at a distance you can cause yourself some real problems, read: loops or broadcast storms.

    The storage problem still has some kinks to be worked out although EMC’s VPlex looks very interesting and quite promising. I hope to get a demo from EMC sometime this year to see well this piece of technology actually works.

  12. Just to add a vendor perspective. We offer full FTP over SSL access for all clients for free. You can FTP in and out drive images using regular FTP clients. Migrating a physical server to the cloud is simply a matter of imaging the drive and FTPing it into our cloud. Likewise, migrating out means FTPing your drive images to another provider. The idea of physical to virtual is a reality and technologically feasible today.

    We thought this was the easiest and most straight forward approach for most users. We don’t charge for incoming bandwidth easy making it easy to transfer in data, trial our server and simply delete out your drives if you aren’t happy.