1 Comment

Summary:

The current mantra in Web operations is to track, record and monitor everything. Data is valuable and storage is cheap, so this makes sense. Metrics that measure the right thing are incredibly important in the context of getting the best performance from your application.

MTTR

The current mantra in Web operations is to track, record and monitor everything. Data is valuable and storage is cheap, so this makes sense. Metrics that measure the right thing are incredibly important in the context of getting the best performance from your application. To that end there were some amazing talks on monitoring, metrics, lies, damn lies, and statistics, at the year’s Velocity Conference. Two of my favorite talks were from John Rauser at Amazon and Kellan Elliot-McCrea from Etsy.

However, I want to propose a new metric: Mean Time to Pretty Chart (MTPC). For full buzzword compliance, let’s say that WebOps + BigData + Information/Graphic Design = MTPC. If you’re not familiar with acronyms like “MTTR” and “MTBF,” metrics abbreviated with MT, meaning “Mean Time,” are common throughout operations-centric businesses of all kinds. MTTR (Mean Time To Recover) and MTBF (Mean Time Between Failure) are two of the most frequently used.

MTPC attempts to quantify the amount of time required to determine the root cause of an operational issue and depict it in an eye-catching way. The MTPC metric is challenging because it encompasses a number of challenges spanning large volumes of data acquisition, storage, correlation and design/representation.

The metric also has a corollary as a business metric for companies selling performance monitoring products. From the time your product is downloaded, your MTPC should be as short as possible. We have seen many companies over the years that could do incredible analysis but were too intrusive to install to get broadly deployed. In cases like this, the MTPC was too high and the business suffered as a result. Today’s trend is for downloadable free trials, and easy to install products focused on keeping MTPC as short as possible.

What is often overlooked by people who don’t operate large scale websites for a living is just how many layers of the stack underlie a modern website that someone needs to monitor. Let’s look at a very complex example. The chart below, courtesy of Adrian Cockcroft at Netflix, depicts the layers of the stack the streaming site monitors and the numerous data sources it ingests.

The Netflix stack is based on Java and runs on AWS, so applications developed in other languages, such as Ruby, PHP, Python, will use different tools for the application layers. The chart above is not meant to be an exhaustive list of the monitoring tools and techniques, nor a comment on the merits of open source tools vs. commercial products. WebOps pros often use a number of both types of products to do their jobs. A highly incomplete list of relevant commercial and open source tools would include Ganglia, Nagios, Cati, Graphite, Munin, Splunk, New Relic, Tracelytics (see disclosure), DataDog (see disclosure), and AppDynamics.

The visualization challenge associated with displaying such large amounts of information about so many physical and virtual entities is potentially more challenging than collecting and storing the data. Numerous options exist for visualizations: simple time series graphs, stack charts, heat maps, and more complex choices, like 3D, which leverage the graphics horsepower in modern GPUs. Let your Minority Report UI fantasies run wild.

However, the state of the art in WebOps visualizations appears to have reached a plateau with RRDTool graphs and Skitch.app annotations. We haven’t made much progress in the past few years. This important capability is basically relegated to a “Lolcat macro.” Don’t believe me? Check out WebObsViz on Flickr and see for yourself.

One of the most powerful graphic representations is to look at data from multiple layers in the stack co-mingled in a coherent fashion. A simple example could be CPU and memory utilization overlaid with the timing of pushing new code into production. Looking at data across layers can be very powerful because it allows one to see how changes at one layer in the stack ripple through the infrastructure. Some leading edge organizations are dumping machine data into packages like SAS, SPSS, and R to do correlations and other more advanced statistical analysis. Enter the Data Scientist. While correlation doesn’t imply causation, with large enough sample sizes the old adage “where there is smoke there is usually fire” often applies. When you can visualize that smoke in a pretty chart, it’s easier to pinpoint the fire.

Further design complexity also stems from having to provide different views into data for different constituencies within an organization. Developers may want to see a certain view into application errors or how multiple services in their distributed system interact with each other. Operations may want lower level machine data, network statistics, or performance metrics from a cloud provider.

Visualizing operations information is further complicated by the challenge of finding an Edward Tufte disciple who can do information design, user interface, and code a bit. The technically competent designer is also one of the hardest spots to fill in our portfolio companies these days. Many companies I speak to have given up searching for this unicorn and have broken out the design and front-end implementation into multiple roles.

The future may be brighter in this regard as the next generation of art and design graduates are digital natives. The work of RISD President John Maeda sits at the nexus of graphic design, computer science, and the fine arts. His disciples, including Ben Fry, Casey Reas, and RISD alum Nicholas Felton, are pioneering new ways to make sense of large sets of data. Not surprisingly Stanford is also on the cutting edge from the computer science perspective with Pat Hanrahan’s Visualization Group leading the way.

Alex Benik is a principal at Battery Ventures. Battery Ventures is an investor in DataDog and Tracelytics.

Image courtesy of Flickr user jspaw.

  1. Great article! It’s true – it is very hard to find the right mix of tech and design, and get a visually appealing way of showing information. We’ve tried hard at CopperEgg to do that for our cloud monitoring product (RevealCloud).

    Share
  2. Mean time to pretty chart- DevOps meets data porn http://t.co/RxUKbbCA

    Share
  3. RT @johnmaeda: On seeing what’s under the hood of the Web. “Enhancing web-ops with data visualization” by @abenik http://t.co/w1bUc3sZ

    Share

Comments have been disabled for this post