Humans have limitations

Untangling the data center from complexity and human oversight

Our investment thesis at Khosla Ventures is that simplicity through abstraction and automation through autonomic behavior will rule in the enterprise’s “New Stack,” a concept that embraces several industry changes:

  • The move to distributed, open source centric, web-era stacks and architectures for new applications.  New Stack examples include Apache web stacks, noSQL engines, Hadoop/Spark, etc. deployed on open source infrastructure such as Docker, Linux/KVM/Openstack and the like.
  • The emergence of DevOps (a role that that didn’t even exist 10 years ago) and general “developer velocity” as a priority: e.g. giving developers better control of infrastructure and the ability to rapidly build/deploy/manage services.
  • Cloud-style hardware infrastructure that provides cost and flexibility advantage of commodity compute pools in both private datacenters and public cloud services, giving enterprises the same benefits that Google and Facebook have gained through in-house efforts.

 

The most profound New Stack efficiency will come from radically streamlining developer and operator interactions with the entire application/infrastructure stack, and embracing new abstractions and automation concepts to hide complexity. The point isn’t to remove the humans from IT — it’s to remove humans from overseeing areas that are beyond human reasoning, and to simplify human interactions with complex systems.

The operation of today’s enterprise data centers is inefficient and unnecessarily complex because we have standardized on manual oversight. For example, in spite of vendors’ promises of automation, most applications and services today are manually placed on specific machines, as human operators reason across the entire infrastructure and address dynamic constraints like failure events, upgrades, traffic surges, resource contention and service levels.

The best practice in data center optimization for the last 10 years has been to take physical machines and carve them into virtual machines. This made sense when servers were big and applications were small and static. Virtual machines let us squeeze a lot of applications onto larger machines. But today’s applications have outgrown servers and now run across multitudes of nodes, on-premise or in the cloud. That’s more machines and more partitions for humans to reason with as they manage their growing pool of services. And the automation that enterprises try to script over this environment amounts to linear acceleration of existing manual processes and adds fragility on top of abstractions that are misfits for these new applications and the underlying cloud hardware model. Similarly, typical “cloud orchestration” vendor products increase complexity by layering on more management components that themselves need to be managed, instead of simplifying management.

Embracing the New Stack developers

Server-side developers are no longer writing apps that run on single machines. They are often building apps that span dozens to thousands of machines and run across the entire data center. More and more mobile or internet applications built today are decomposed into a suite of “micro-services” connected by APIs. As these applications grow to handle more load and changing functionality, it becomes necessary to constantly re-deploy and scale back-end service instances. Developers are stalled by having these changes go through human operators, who themselves are hampered by a static partitioning model where each service is run on an isolated group of machines.

Even in mature DevOps organizations, developers face unnecessary complexity by being forced to think about individual servers and partitions, and by creating bespoke operational support tooling (such as service discovery and coordination) for each app they develop. The upshot is pain of lost developer time on tooling, provisioning labor and hard costs underutilization that results from brute force “service per machine” resource allocation.

We believe the simpler path for the New Stack is to give power to developers to write modern data center–scale applications against an aggregation of all of the resources in the data center, to build operational support into apps as they are created, and to avoid management of individual machines and other low-level infrastructure.

Delivering such an abstraction lays the foundation for an increasingly autonomic model where logical applications (composed of all of their physical instances and dependent services) are the first-class citizens deployed and automatically optimized against the underlying cloud-style infrastructure. Contrast this with the typical enterprise focus on the operation of servers as first-class citizens — a backward-looking approach that represents pre-Cloud, pre-DevOps thinking.

Distributed computing isn’t just for Google and Twitter

Turing Award winner Barbara Liskov famously quipped that all advances in programming have relied on new abstractions. That truth is even more pronounced today.

Most enterprises adopting New Stack today will have a mixed fleet of applications and services with different characteristics: long running interactive applications, API/integration services, real-time data pipelines, scheduled batch jobs, etc. Each distributed application is dependent on other services and made up of many individual instances (sometimes thousands) that run across large numbers of servers. This mixed topology of distributed applications running across many servers is geometrically more complex than an application running on a single server. Each service that comprises the application needs to simultaneously operate independently and coordinate with all of the interlocking parts to act as a whole.

In the above model, it’s inefficient to use human reasoning to think about individual tasks on individual servers. You need to create abstractions and automations that aggregate all of the individual servers into what behaves like one pool of resources, where applications can call upon the computation they need to run (the CPU/memory/I/O/storage/networking) without having to think about servers. To achieve optimal cost structure and utilization, the resource pool is aggregated from low-cost, relatively homogenous equipment/capacity from multiple vendors and cloud providers, and deployed under a uniform resource allocation and scheduling framework. This strategy avoids costly, specialized equipment and proprietary feature silos that lead to lock-in and ultimately less flexibility and manageability at scale.

Google was the first to overcome the limits of human oversight of data center resources with this resource aggregation approach by building its own resource management framework (which was initially called Borg, then evolved into Omega). Twitter rebuilt its entire infrastructure on top of the Apache Mesos distributed systems kernel to kill the “fail whale.” You could argue that Google and Twitter — in the absence of innovation from the big systems players — created their own operating systems for managing applications and resources across their data centers. That simple idea of a data center operating system — although complex to create and execute in the first place — is what drove our most recent investment in Mesosphere.

We believe the adoption of this type of “data center OS” outside of the largest web-scale businesses is an inevitability. Even small mobile applications have outgrown single machines and evolve much more rapidly. Managing change as a process instead of a discrete event has become table stakes for CIOs, and daily changes in business models make data center resource requirements highly unpredictable. Elastic infrastructures are no longer a “nice-to-have.” And human beings and manual oversight have reached their limits.

Vinod Khosla 

3 Responses to “Untangling the data center from complexity and human oversight”

  1. Brian Jones

    It was as long ago as 1989 when I was responsible for the electrical services design of my first Data Centre. I was the Electrical Associate for Aukett, an architectural and building services consultancy based in Chelsea when we were appointed by Sun Alliance Insurance to design their new Data Centre in Southwater, Surrey.
    Sun Alliance were very keen to have total flexibility and to avoid having any AC/Electrical plant within the machine hall (other than dual supply PDU’s) which dictated the need to have an undecroft feeding the machine hall above and housing plant for a total electrical design load of 800kw including:
    Distribution switchgear
    N+1 (2no) 400hz frequency converters feeding IBM main frame computers
    N+1 (3no) 50hz 500kva rotary UPS
    Machine Hall mechanical plant including Chilled Water Units
    Halon bottles
    The remaining electrical plant was housed in an adjacent plantroom housing:
    2no 1000kva transformers substation
    N+1 (3no) 1250kva Standby diesel generators
    Main intake switchgear
    Lead acid batteries
    Mechanical plant MCC’s etc
    So how would the design differ given the same brief today? Certainly there would not be the need for the 400hz frequency converters for main frame computers and the halon system would have to give way to halocarbon-based agents but the general layout would still provide a good solution. However the need for future capacity at the time was under estimated. The electrical design load of 800kw was meant to cater for 5years growth but less than a year later we were being appointed to build a new power house to double the capacity with the need to consider replacing the UPS/Batteries/Generators with Diesel Rotary UPS.

  2. dstributed

    The article nicely embraces the future of highly distributed application programming against a DCOS as well as the operating model which comes along and Mesosphere will play a big role in there.