14 Comments

Summary:

Server virtualization was supposed to make utilization rates go up. But utilization is still low and solutions to solve that will change the way the data center operates.

server room pic

Want to know a deep dark secret in the enterprise? Despite the widespread use of server virtualization, hardware resources in the data center are today tremendously underutilized. When I first discovered this, I was surprised. I had naively thought that virtualization solved the server utilization problem.

But after I reached out to contacts in data-center operations at various companies, I learned that I was wrong. One conversation really stuck with me. I’ll paraphase:

Me: Do you track server and CPU utilization?
Wall Street IT Guru: Yes
Me: So it’s a metric you report on with other infrastructure KPIs?
Wall Street IT Guru: No way, we don’t put it in reports. If people knew how low it really is, we’d all get fired.

While exact figures vary greatly depending the type of hardware being used in a data center, its specific characteristics and the peak-to-average ratio of the workload, low utilization has been widely observed. A few data points from the past five years:

  • A McKinsey study in 2008 pegging data-center utilization at roughly 6 percent.
  • A Gartner report from 2012 putting industry wide utilization rate at 12 percent.
  • An Accenture paper sampling a small number on Amazon EC2 machines finding 7percent utilization over the course of a week.
  • The charts and quote below from Google, which show three-month average utilization rates for 20,000 server clusters. The typical cluster on the left spent most of its time running between 20-40 percent of capacity, and the highest utilization cluster on the right reaches such heights only because it’s doing batch work.
“Such WSCs tend to have relatively low average utilization, spending most of (their) time in the 10–50% CPU utilization range. This activity profile turns out to be a perfect mismatch with the energy efficiency profile of modern servers in that they spend most of their time in the load region where they are most inefficient.” – Data Center as a Computer by Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle.

“Such WSCs tend to have relatively low average utilization, spending most of (their) time in the 10–50% CPU utilization range. This activity profile turns out to be a perfect mismatch with the energy efficiency profile of modern servers in that they spend most of their time in the load region where they are most inefficient.” – Data Center as a Computer by Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle.<br />

Efficient hardware is only part of the solution

Most discussions on this topic relate to “Energy Proportional Computing.” My firm has made investments in other data-center companies to attempt to address the performance-per-watt overhead of running servers in their least-efficient operating ranges. Making servers more energy efficient is a noble goal, but the flip side of this coin is to run the machines you already have installed at higher utilization in their more efficient power/performance range. Since it’s a fair bet the numbers achieved by Google are industry leading, I feel safe in concluding that server infrastructure more broadly is still woefully underutilized.

I find this fascinating, especially in light of the broader industry move toward what I call Everything –as-a-Service, EaaS. If I’m Amazon or another, similar provider of raw compute or storage services, my aim is to maximize the amount of revenue I can achieve for every dollar I spend on servers. Each incremental increase in efficiency is either money that flows straight to the bottom line, or cash that can be re-invested in other projects It’s essentially the same for internal private-cloud initiatives. In the enterprise there can be organizational impediments that leads to artificially low infrastructure utilization by these walls are crumbling as well.

At the same time, there are a number of fundamental changes taking place at the hardware level to compute infrastructure. There is tremendous interest from folks like Intel and Facebook in new rack- level architectures that allow for the ability to scale memory and compute independently. The key reason for this is that in many cases, systems run out of usable memory long before CPU becomes an issue.

To be fair, this is often one reason CPU usage is often shockingly low, and measuring CPU alone is far from the whole story. Rack-scale computing; microservers; new levels of the memory hierarchy; low latency ethernet fabrics; and new instruction sets like ARM in the data center are all coming in the next few years. If the best the industry can do in a relatively homogenous X86 world is +/- 10-25 percent utilization, think about how much more complex the problem is going to be with a raft of new hardware options to optimize around.

A post-hypervisor world

While it’s clear that hypervisor-based virtualization brings many benefits to the enterprise, it’s not obvious that high levels of infrastructure utilization is one of them. It’s also possible that we are moving to a post-hypervisor world. The type of container and operating-system based virtualization that has long been available in Solaris and FreeBSD is now in Linux. Hat tip to the team at Joyent who has been preaching this gospel for ages but requiring the use of the Solaris or its open sourced descendants SmartOS, Illumos, OmniOS, etc.

The rapid adoption and popularity of platforms like Docker, a platform to build and managed Linux-based containers is a testament to the interest in alternative approaches. The advantage of this type of model is that it eliminates the additional operating system instances and the associated resources they require. It also simplifies operating system distribution and management. Minimizing the burden of the guest OS or eliminating it all together are the goals of two interesting lightweight operating systems projects CoreOS and OSv. Tying these new pieces of the puzzle together at data center scale are new cluster management frameworks like what Google recently discussed with its Omega project and the Apache Mesos which is now commercially supported by Mesosphere.

This set of new technologies may lay the foundation for the post-hypervisor era in the data center. As has been the recent trend the foundational work and initial deployments have started inside the web scale data centers and then into the open source community. It’s clear that leading edge enterprises are next.

It’s a dynamic time in the data center. New hardware and infrastructure software options are coming to market in the next few years which are poised to shake up the existing technology stack. Should be an exciting ride.

Alex Benik is a principal at Battery Ventures who invests in enterprise and web infrastructure start-ups. You can find him on Twitter at @abenik

  1. I wonder if those EC2 “machines” are under utilised at the host or at the guest instance level. I would imagine that AWS are quite efficient at getting the most out of their host machines so they can drive down cost. Indeed, it was all about selling the spare capacity Amazon had that started off the idea of the cloud.

    Do you think this could be an opportunity for marketplaces like OnApp have where providers can sell off their spare capacity so as a customer you can buy instances from many underlying vendors through a single platform?

    Or do you think it’s more an opportunity for infrastructure software vendors like monitoring companies to provide more analytics on this so owners can make more efficient use of their hardware? Do they even want to?

    Share
    1. Vinod Shintre Sunday, December 1, 2013

      David, its at the guest level. We also offer granular view by week day & display which week days the systems are being hogged Vs underutilized. not a head turner for many but again it might be useful for some to switch off the vm’s when not in use.

      Share
    2. The short answer is no, they are not underutilised. Look at this article for example: http://blog.carlmercier.com/2012/01/05/ec2-is-basically-one-big-ripoff/
      When you compare between VPS providers such as Amazon and Linode you realize the differences in performance.

      I think the author must separate between different use cases of server virtualization. For example, the main use in my company is to run a lot of Windows variations for QA, not only we never underutilised our server, we buy more powerful servers every 1.5 years.

      Share
  2. A challenge we see is that traditional applications are still mapped to (physical, virtual, or cloud) servers, by system admin staff or tools and there are large variances in this step based on the application role and requirements. The container abstraction is powerful as it allows separation of applications from servers, which then allows standardization & automation of system administration across different application types.

    At Nirmata, we are creating a policy driven application run-time layer using Docker, and more importantly redesigning applications as composite cloud services which enable elasticity.

    Rather than measuring individual server utilization, do you think there are better ways to measure and report system efficiency?

    Share
  3. Container based virtualization has long been available in Linix with OpenVZ.

    Share
  4. Vinod Shintre Sunday, December 1, 2013

    yes on ec2 we see on an average upto 15% & as mentioned in this report

    Share
  5. Jonathan Frappier Sunday, December 1, 2013

    Container based virtualization is an excellent tool, but to dedicate an entire article to it simply because CPU utilization is low other virtualized environments is only telling part of the story. CPU utilization is NOT the bottle neck in virtualized environments, its storage. Unless your linux containers are not reading or writing any data, they will face the same limitation as the solutions you are knocking because they are not fully utilizing a CPU. http://www.virtxpert.com/the-sorry-state-of-server-virtualization-really-gigaom/

    Share
  6. And of course with all the instrumentation available in illumos and FreeBSD, it’s a lot easier than it has ever been to get to the bottom of performance problem that occur when multitenancy is taken a bit too far Usually it’s just a client’s app that’s performing poorly, or part of their stack that’s misconfigured, but being able to conclusively prove this is invaluable.

    Share
  7. Looking at CPU utilization is not the right metric to watch. The right combination of metrics includes CPU, Memory, Network (In and Out) & storage.

    But, the most important metric that should be considered (in correlation with the operational metrics above) is the “engagement metric” such as website hits, number of concurrent users etc. Virtualization efficiency can’t be defined with operational metrics only. They must have the “business context”.

    Given the different workload types, different policies should be defined, per use case. For instance, its clear that a slave database will carry low CPU & Network utilization (up to the point that the slave turns into master, due to failure).

    More details regarding public clouds utilization rates (Including all different type of metrics) can be found in : http://www.slideshare.net/Cloudyn/microsoft-cloud-day-more-bang-for-your-cloud-final

    Share
  8. The type of virtualization has very little to do with CPU utilization. Hypervisor or containers really doesn’t make that much difference. The real problem is the mismatch between the CPU performance that is way overkill for the available I/O bandwidth. This is the real problem with using Intel CPUs in the data center. It would be better to match the CPU capability to the I/O bandwidths for a given server and anticipated workload. However, Intel continues to push power hungry furnaces for this task. The result is that folks like Facebook are building datacenters on the Arctic Circle to keep them cool. All this said, my company is building ARM based servers so I’m obviously biased but the arguments I’m making here are based on numbers as much as my desire to solve what I think is this fundamental hardware mismatch.

    Share
  9. Where does this premise come from that servers are supposed to be highly utilized? That is simply not the norm: online systems usually have to be waiting for action, which means they’re underutilised by design much of the time. Yes, it’s possible to interleave some bursty workloads, but only if the workloads can tolerate added latency. I work in research HPC, for instance, where the main model is, in fact, interleaving multiple bursty workloads. Most users are tolerant of the queueing – but it does introduce delays, even as it permits a well-run HPC cluster to stay quite close to 100% busy.

    I see no reason to see why business servers would be expected to behave this way, since they cannot normally be latency tolerant. Yes, some benefit can come from resource pooling in a cloud, but even that can’t hide the fact that businesses run mostly during daylight hours, or that commercial activity levels are often correlated across businesses (such as the current black-friday frenzy!)

    Share
  10. I’m not following the ‘post-hypervisor’ comment… You say we could be moving to a post-hypervisor world, but these operating systems are simply a different flavor of hypervisor. The form may look a bit different, but the function is the same. One of your examples, SmartOS, even refers to itself as a type 1 hypervisor.

    Share

Comments have been disabled for this post