The sorry state of server utilization and the impending post-hypervisor era

server room pic

Want to know a deep dark secret in the enterprise? Despite the widespread use of server virtualization, hardware resources in the data center are today tremendously underutilized. When I first discovered this, I was surprised. I had naively thought that virtualization solved the server utilization problem.

But after I reached out to contacts in data-center operations at various companies, I learned that I was wrong. One conversation really stuck with me. I’ll paraphase:

Me: Do you track server and CPU utilization?
Wall Street IT Guru: Yes
Me: So it’s a metric you report on with other infrastructure KPIs?
Wall Street IT Guru: No way, we don’t put it in reports. If people knew how low it really is, we’d all get fired.

While exact figures vary greatly depending the type of hardware being used in a data center, its specific characteristics and the peak-to-average ratio of the workload, low utilization has been widely observed. A few data points from the past five years:

  • A McKinsey study in 2008 pegging data-center utilization at roughly 6 percent.
  • A Gartner report from 2012 putting industry wide utilization rate at 12 percent.
  • An Accenture paper sampling a small number on Amazon EC2 machines finding 7percent utilization over the course of a week.
  • The charts and quote below from Google, which show three-month average utilization rates for 20,000 server clusters. The typical cluster on the left spent most of its time running between 20-40 percent of capacity, and the highest utilization cluster on the right reaches such heights only because it’s doing batch work.
“Such WSCs tend to have relatively low average utilization, spending most of (their) time in the 10–50% CPU utilization range. This activity profile turns out to be a perfect mismatch with the energy efficiency profile of modern servers in that they spend most of their time in the load region where they are most inefficient.” – Data Center as a Computer by Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle.

“Such WSCs tend to have relatively low average utilization, spending most of (their) time in the 10–50% CPU utilization range. This activity profile turns out to be a perfect mismatch with the energy efficiency profile of modern servers in that they spend most of their time in the load region where they are most inefficient.” – Data Center as a Computer by Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle.<br />

Efficient hardware is only part of the solution

Most discussions on this topic relate to “Energy Proportional Computing.” My firm has made investments in other data-center companies to attempt to address the performance-per-watt overhead of running servers in their least-efficient operating ranges. Making servers more energy efficient is a noble goal, but the flip side of this coin is to run the machines you already have installed at higher utilization in their more efficient power/performance range. Since it’s a fair bet the numbers achieved by Google are industry leading, I feel safe in concluding that server infrastructure more broadly is still woefully underutilized.

I find this fascinating, especially in light of the broader industry move toward what I call Everything –as-a-Service, EaaS. If I’m Amazon or another, similar provider of raw compute or storage services, my aim is to maximize the amount of revenue I can achieve for every dollar I spend on servers. Each incremental increase in efficiency is either money that flows straight to the bottom line, or cash that can be re-invested in other projects It’s essentially the same for internal private-cloud initiatives. In the enterprise there can be organizational impediments that leads to artificially low infrastructure utilization by these walls are crumbling as well.

At the same time, there are a number of fundamental changes taking place at the hardware level to compute infrastructure. There is tremendous interest from folks like Intel and Facebook in new rack- level architectures that allow for the ability to scale memory and compute independently. The key reason for this is that in many cases, systems run out of usable memory long before CPU becomes an issue.

To be fair, this is often one reason CPU usage is often shockingly low, and measuring CPU alone is far from the whole story. Rack-scale computing; microservers; new levels of the memory hierarchy; low latency ethernet fabrics; and new instruction sets like ARM in the data center are all coming in the next few years. If the best the industry can do in a relatively homogenous X86 world is +/- 10-25 percent utilization, think about how much more complex the problem is going to be with a raft of new hardware options to optimize around.

A post-hypervisor world

While it’s clear that hypervisor-based virtualization brings many benefits to the enterprise, it’s not obvious that high levels of infrastructure utilization is one of them. It’s also possible that we are moving to a post-hypervisor world. The type of container and operating-system based virtualization that has long been available in Solaris and FreeBSD is now in Linux. Hat tip to the team at Joyent who has been preaching this gospel for ages but requiring the use of the Solaris or its open sourced descendants SmartOS, Illumos, OmniOS, etc.

The rapid adoption and popularity of platforms like Docker, a platform to build and managed Linux-based containers is a testament to the interest in alternative approaches. The advantage of this type of model is that it eliminates the additional operating system instances and the associated resources they require. It also simplifies operating system distribution and management. Minimizing the burden of the guest OS or eliminating it all together are the goals of two interesting lightweight operating systems projects CoreOS and OSv. Tying these new pieces of the puzzle together at data center scale are new cluster management frameworks like what Google recently discussed with its Omega project and the Apache Mesos which is now commercially supported by Mesosphere.

This set of new technologies may lay the foundation for the post-hypervisor era in the data center. As has been the recent trend the foundational work and initial deployments have started inside the web scale data centers and then into the open source community. It’s clear that leading edge enterprises are next.

It’s a dynamic time in the data center. New hardware and infrastructure software options are coming to market in the next few years which are poised to shake up the existing technology stack. Should be an exciting ride.

Alex Benik is a principal at Battery Ventures who invests in enterprise and web infrastructure start-ups. You can find him on Twitter at @abenik

loading

Comments have been disabled for this post