It’s become almost cliche to bemoan the state of infrastructure monitoring and management tools. Cliff Moon, the CTO of Boundary, cuts through the litany of complaints and explains why it’s time for us to think of monitoring in a new way.

OODA image

Today’s operations engineers are faced with choosing between two imperfect routes for infrastructure monitoring. On one hand, there’s no shortage of complicated, inflexible and expensive enterprise tools shrouded in the glory of vendor lock-in. On the other, we have a veritable zoo of open source tools — many of which are great at addressing specific pain points, but are small pieces of a larger puzzle.

The failure of the space has spawned a loosely-organized grassroots movement in the devops community to address the challenge, and led to numerous blog posts, an IRC channel and a collection of GitHub repos.

Although there is a litany of complaints, I would submit that this pain is rooted in a way of monitoring that ignores the realities of growing and scaling businesses along with the demands of fast-paced infrastructure teams. Perhaps it’s time for us to think of monitoring in a new way.

A new (old) model
During the Korean War, United States Air Force Colonel Robert Boyd formulated the “OODA loop.” OODA stands for observe, orient, decide and act. Boyd theorized that the faster a team could understand what’s happening, orient themselves to the situation, decide how to respond to it, and act — the greater their readiness and haste of response. Boyd’s insight suggests that teams iterating through the loop faster gain a competitive advantage over opponents. I’d suggest that any well-designed monitoring tool can help automate the OODA loop for operations teams. Below are the essential components of monitoring infrastructure for fast-paced teams.

1. Deep integration
Most open source monitoring tools only tackle one aspect or a subset of the OODA loop. For instance: Graphite and Cacti provide trending (orientation), Nagios provides alerting (decision and action) and Statsd and Collectd gather metrics (observation). But integrating these projects is a daunting task and often takes the form of a Frankenstein’s monster of Perl scripts and PHP dashboards. While each of these tools are helpful, they only paint part of the picture. An ideal tool might integrate all four steps of the OODA loop into one harmonious system. Where necessary, one would also expect API endpoints to allow for custom behavior and flexibility to further automate a team’s action.

2. Contextual alerting and pattern recognition
Most monitoring tools require the user to predefine all of the conditions on which to alert. For instance, one would set static thresholds that say, “Notify me when disk usage goes above 90 percent,” or, “Notify me when CPU usage goes above 75 percent.” However, static thresholds are a poor substitute for pattern recognition, the basis of cognitive decision-making. Setting static thresholds for applications whose load varies throughout the day, week, or month is hell. At any given point, monitoring infrastructure should be able to reflect upon its current state, past state, and forecasting and ask, “Are current trends sufficiently deviant enough to warrant action?” And if so, it should immediately notify the team with context. What if ops teams could look at a graph and say to the system, “Alert us when something looks (or doesn’t look) like this?”

3. Timeliness
The term “real time” has been watered down, but it carries a specific meaning. Real-time computing concepts in monitoring systems relate to an intrinsic property of events: they happen on a timeline. Monitoring systems must be real time, because the timeliness of the data impacts its correctness and utility. All aspects of a monitoring system must respond immediately to events. The OODA loop is only effective when it is faster than the environment or opponent that it is running against. If you’re operating on assumptions that are a minute old, it’s hard to say much of anything about what’s happening now.

4. High resolution
The resolution of monitoring systems is critical. With most options offering updates once every one to five minutes, low-resolution monitoring obscures a world of patterns that are invisible until you’ve zoomed in. The difference between a one-second graph updated in real time and a one-minute graph updated every five minutes is the difference between a fluid HD film and a paper flip-book.

5. Dynamic configuration
The fluidity of modern architectures demands monitoring infrastructure that can keep up with the changes that ops teams require. The rise of virtualized infrastructure combined with dynamic configuration management systems means that there may be a great deal of host churn. This churn challenges the concepts of host identity that traditional monitoring tools have built in as fundamental abstractions.

What’s next for monitoring?
The pace of business today requires tools that help teams move rapidly through the OODA loop. Smarter software can push this process forward, offering deep integration across infrastructure, pattern recognition to quickly spot problems, real-time updates at high resolution and automatic adaptation to changing environments. With a set of tools like that, operations teams can respond to incidents, resolve them for good and drive business value while leaving competitors’ heads spinning.

Cliff Moon is the founder and CTO of Boundary, a provider of real-time network and applications monitoring-as-a-service. The views expressed here are personal and do not necessarily reflect those of any Company with which he is or has been affiliated.

Special thanks and acknowledgement goes out to Coda Hale for his views on monitoring and metrics (read Metrics, Metrics Everywhere).

Image courtesy of Flickr user purpleslog.

  1. An article about monitoring written by the CTO of a new monitoring company… hmmm.

  2. I’ve written about the ideal monitoring system that I would really like to have from all sorts of reasons starting from DevOps, as a developer and as an analyst using this data.


    1. Have you heard of LogicMonitor? [logicmonitor.com]

  3. I am not an IT guy, but i do practice and teach the concepts of COL John Boyd’s work. I love your reference to OODA Loops and how you apply them here. Boyd’s ideas are making inroads and making more effective numerous disciplines such as; Research and development, leadership methodologies, law enforcement effectiveness, security and here the IT world. Great for Boyd’s legacy.

  4. Good piece, Cliff. We here at New Relic have a similar belief and are on a similar mission. http://newrelic.com

  5. Monitoring sucks = marketing spin, disappointed gigaom publishing a marketing piece as real issue. See netuitive, quest foglight, uptimesoftware as reasonable cost options. Virtulization dramatically reduces monitoring cost, The real issue is current survey reports 24% of organizations don’t do any performance monitoring, no excuse for this

    1. John E. Vincent Sunday, February 12, 2012

      @parkercloud As the guy behind the blog posts/github repos and general bitching, I can assure you there’s nothing marketing related behind it. I find it funny you would mention commercial products out of one side of your mouth whilst lambasting a non-commercial non-vendor specific post out of the other.

      If the tools you mentioned met the needs of the actual people who used them (hint: they don’t), you wouldn’t see such a groundswell of frustration from people who have to use them.

    2. We do performance monitoring. Check it out: logicmonitor.com

  6. Mahendra Kutare Sunday, February 12, 2012

    Monitoring and analysis needs to come along together for next generation monitoring systems. The point which we made in this paper – Monalytics: online monitoring and analytics for managing large scale data centers.
    http://dl.acm.org/citation.cfm?id=1809073. There are none software companies who are building anything close to this work yet. Large scale distributed systems specially datacenters and cloud systems will need to build monitoring to detect, analyze and act/recover of/from faults as soon as possible to fend off catastrophic failures. Local and global control loops which are called old(new) model here are proposed methods/techniques in our work here. There is no current implementation that provides anything close to the features that are designed and implemented as part of Monalytics system.

    Many of the current implementations such as Ganglia, Nagio provide static topologies and passive observation through visualization that are not going to scale for dynamism and reconfiguration properties of cloud systems.

    I would say I agree with the premise of this article, but there is lot more than what I proposed in this article. Sadly, many still feel that this is spin :) I would say wait till world of datacenter and cloud systems with emergent phenomenon results in a complex system thats hard to understand and analyze.

    Please go through this paper and email me if you have any feedback at – mahendra.kutare@gmail.com

    1. Is there a copy of this paper anywhere online other than behind the ACM paywall?

  7. Have you heard of Splunk?

    Check it out…

    1. Adam Vandenberg Monday, February 13, 2012

      Splunk == Hella expensive vendor lock-in (though it is a reasonable product)

  8. What is that yin/yang picture a picture of?

  9. Stephen Burton Sunday, February 12, 2012

    Sounds very familiar to the blog I published a few weeks ago on the AppDynamics Blog and Sys-Con titled “Why Alerts Suck and Monitoring Solutions must become smarter”. http://www.appdynamics.com/blog/2012/01/23/why-alerts-suck-and-monitoring-solutions-need-to-become-smarter/


    1. JXInsight/OpenCore Monday, February 13, 2012

      stephen there is a big difference between the dumb ass low hang fruit monitoring you offer and actual active management (hint: management != monitoring)

      Anyway did you publish your diatribe after we published:



  10. In addition to monitoring, that is reactive in nature, Enterprises should consider the pro-active business analytics approach. The problem with monitoring is that huge performance sinks are hidden from it, simply because they have always been part of the system. A bad index that has been in a banking application since the 90’s is compensated for by excess hardware and by a workflow that provides enough time for the batch operation to complete. No one knows that adding a column to that index can shave 50% of a certain I/O operation, and no monitoring solution will alert you for that.

    The huge amount of information collected by monitors, RDBMS optimizers and in-house solutions is a real haystack with performance improvement gold hiding in it. For DB2 based enterprise applications, our company http://www.InnovizeIT.biz has been able to save millions annually by finding hidden performance sinks that monitoring tools miss.


Comments have been disabled for this post