Blog Post

Wanted: Virtualization Engineer, Referee Exp. Pref.

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

The virtualization of systems allows for efficient use of server resources and is clearly a trend that many enterprises are embracing. Systems engineers see virtualization as the next generation of tools that can help scale their servers, while network engineers see the virtualization trend headed in their direction as well. Unfortunately, it seems that server virtualization also helps foster trench warfare between the two.

I found myself witness to one small skirmish in this battle today, when I met with a startup looking for funding. The startup is building enterprise services, and for its next generation plans to make heavy use of XenSource’s XenMotion functionality to manage virtual machines on about 50 physical servers. This functionality, which is similar to that of VMware’s VMotion, promises to seamlessly move a virtual machine from one physical server to another. The startup’s service product could be running in one virtual machine on a server and if the server receives too much load or has a failure, the XenMotion functionality could move the virtual machine to another server without resulting in any downtime. For an enterprise services startup, avoiding downtime is a good idea.

I asked some questions about the network and systems architecture and found that the systems engineers had made the assumption that in the new service, any virtual machine could be allocated to any physical server. The network engineers, unfortunately, had not taken this into account. Based on the physical network topology — a classic three-tier architecture — the network engineers had set up firewall rules and access-control lists to appropriately protect the infrastructure. For example, not every server could be accessed from the Internet and only certain physical servers had permission to mount storage area network resources. If using XenMotion meant every server was expected to house any virtual machine at a moment’s notice, these were clearly issues that needed to be resolved.

The systems engineers’ expectation of being able to move any virtual machine to any physical server in the infrastructure meant a complete redesign of the network topology was required. And that is when the skirmish ensued. The systems engineers insisted that the network topology be set up to allow XenMotion to work seamlessly. The network engineers argued that their network topology was necessary for scalability and security. As far as I was concerned, they were both right, so before continuing my due diligence on their business, I sent them off to settle their skirmish amongst themselves.

But it had got me thinking: Has server virtualization added an abstraction layer that further separates systems engineer and network engineers from the physical reality of their environments? Do we need a new engineer — a virtualization engineer — that understands how virtual machines are allocated on physical servers and networks to act as a liaison between the two factions?

17 Responses to “Wanted: Virtualization Engineer, Referee Exp. Pref.”

  1. c1tr1xguru

    This is where developing a true strategic plan around the use of virtualization helps. It’s not that we need “virtualization engineer”, you need an architect that understands how to put the right people in the same room for a few days to put all the requirements on the table, to build an understanding of the impacts, line out the alternatives (if any), and last, but certainly not least, build a consensus.

    Now who are the players? the Project Sponsor (CIO, Dir of IT, Sr VP of IT), Key application owners, Server architecture lead, LAN/WAN architecture leads, Data Center Mgr, and Security lead. Besides the above mentioned benefits of putting all these people in the room, it also cuts down the delay in Q&A. If someone has a question, there is no more of the “we have to ask the LAN guys”, they are right there in the room.

    I’ve done many of these facilitated sessions and I can guarantee that when everyone walks out of the room after a few days they all have a vested interest in the final solution.

  2. Carmelo Lisciotto

    All great solutions discussed…
    but when put into practice there goes all they highly touted cost savings that virtualization promised…

    Carmelo Lisciotto

  3. “… if the server receives too much load or has a failure, the XenMotion functionality could move the virtual machine to another server without resulting in any downtime.”

    If a server “has a failure”, how can moving the VM be done without any downtime?

  4. Here are my thoughts as the CTO of Neverfail: This is a great example of the need for automated intelligent management of virtual machines. While ZenMotion and VMotion make it possible to move servers around easily that doesn’t mean those servers will continue to provide the same level of performance or availability when you do. In this example the XenMotioned servers could fail completely because they no longer have the necessary security to work at the new location in the network. Another possibility is that moving a virtual machine to a new server at a different location on the network might mean it would cause unacceptable performance. A good example of the latter is moving SQL servers holding SharePoint documents to a different network segment from SharePoint front end servers would cause serious performance issues slowing down SharePoint. The bottom line is that virtualization has added yet another layer of complexity that makes predicting and protecting systems from fault and performance problems almost impossible even for the most expert administrators. The solution is to monitor and automate virtual machines not just at the physical machine and network layers, but also taking into account knowledge of the applications and security configuration of the machines themselves.

  5. @Don – that’s (6000 VMs) about 5x the size of the environment we built out (at least, the size when I moved on), and definitely to the point that it makes sense to take a good look at distributing those services across multiple locations or networks. One of my favorite things about a fully-virtualized environment is that everything becomes a commodity akin to power – we have big logical containers of CPU cycles and RAM and storage, and when we combine those with some high-end load balancers we get a very flexible and easily scalable environment. If you can then define what pieces of equipment you need to build out such an environment, you can turn the datacenter itself into a single logical container, and plunk down one of these containers anywhere that’s got good connectivity to your customers and good pricing. It’s a little like building hierarchically with Legos – when all the pieces are interchangeable and well-defined, the infrastructure becomes consistent, simple to manage, easy to learn and massively scalable.

    (of course, if all your customers are in one location – say, academia or a compute cluster somewhere – distributing load geographically may not make sense, but the concept still applies on the local level, I think.)

  6. @VirtualMan – thanks, I’ll check out Scalent.

    @darkuncle – the folks I met with were using VLANs, but probably not as you describe. I’ll dig into that more with 802.1q trunks on the servers.

    @Daon – thanks for the comments and sharing your experiences.

  7. Don Nalezyty

    I have first hand experience attempting to negotiate this exact battle for about 4 years.

    We’ve had a stateless Grid for over 7 years, which allowed us not to migrate a live instance, but extremely rapidly move a shutdown instance to any server in the Grid. We use PXE to boot the servers from NFS volumes on NAS. When we only had 240 servers in the grid it wasn’t difficult to have them all in a single network space.

    Then the Grid continued growing and we had added requirements for more than simple http access from the internet. Security and networks began to grow more and more concerned about the environment. About 4 years ago we realized we needed to go virtual to remain cost effective and be able to scale as the business continued growing, which finally opened the first shots in the war.

    There are a number of vendors with solutions attempting to resolve it. As mentioned by stockandawe, Cisco VFrame is one. We were lucky enough to be part of a very early beta for this product. While Cisco has done a excellent job of seeing the nature of the problem and attempting to create a solution, it has one major draw back. VFrame needs complete control of the environment to work it and both Security and Networks were unable to overcome cultural and very real concerns about doing this.

    Scalent (as noted by virtualman), Xsigo and a few others have taken a similar approach, but all face the same challenges. The cultural issues can’t be ignored. It’s ironic that as technologists we can fall into the same trap we so often accuse non-technologists: Fear of technology because it’s new and untried.

    Many companies are going to be unwilling to be the first to adopt these new technologies because they not proven by wide acceptance. It’s often up to the smaller companies that aren’t afraid to take a risk and are capable of being more agile to adopt these technologies and prove they are reliable.

    I think darkuncle hit the nail on the head, when he said everyone needs to be involved from the start. By involving networks security, apps and systems architects and engineers from the start you have the opportunity to support one another through large leaps of faith.

    We haven’t really achieved the holy grail of a fully virtualized environment through all components and infrastructure yet, but we’re getting there. We’ve moved the entire Grid behind SSL-VPN and much like darkuncle, we’ve used VLANs to gain some flexibility, but that has limits as well. Our Grid has grown significantly and can host upto 6000 or so VMs, which is like a small datacenter unto itself. Having a single network space encompass so many systems adds complexity and risk.

    Until more of the bigger vendors become more engaged in this space, it’s going to be a challenge finding solutions that make everyone happy.

  8. darkuncle

    as with any environment (virtualized or otherwise), those working on architecture (systems, app, network and security engineers alike) have to all be involved together from the get-go. Dropping in any new technology without making sure everyone understands how it interoperates with the existing infrastructure (as well as the risks and business case) is a recipe for failure.

    In this particular case, the correct approach (well, one correct approach anyway) is to use VLANs, tag all the server VLANs that will be in use for any VMs to all ESX hosts, and firewall based on VLAN, rather than based on physical host. This approach tends to be more flexible and scalable for neteng in addition to the benefits for the systems and application folks (speaking from experience in building out just such an architecture).

  9. …and following up on StockAndAwe’s comments, looks like Scalent Systems has stolen a march on Vframe.

    From the website:

    “Scalent…lets data centers change server software, network connectivity, and storage access in real-time”


    “Scalent code ships in VMware ESX Server”.

    Seems like a network-boot solution already exists to do the first-stage work you’re describing, Allan. (EG, power on a machine, set up the right network and storage connectivity, and boot the right hypervisor…)

    Scalent also appears to be resold by HP, Unisys, and EMC. I thought I saw Scalent listed in the Nexus press release too (they are in the Google cached version), but then they’re missing on the Cisco site version…?

  10. stockandawe

    I think the issues/concern that you raise have been started to get addressed as the concept of virtualization of the server, the network and the storage start working together. For a data center solution, you need to look at solutions for all of these instead of a single server-only or network-only solution. As your applications move, your network & storage has to be configured accordingly and what you need is the management of this which is addressed by products like Cisco VFrame Data Center.

    Check out this talk on Cisco Nexus 5000 (which is a network entity) and how it it designed (theoretically) to enable VMWare (which is a server-virtualization entity).

    Disclaimer: I am not a systems/network/virtualization engineer, but in a past life did work at Cisco.