Why monitoring sucks — for now


Today’s operations engineers are faced with choosing between two imperfect routes for infrastructure monitoring. On one hand, there’s no shortage of complicated, inflexible and expensive enterprise tools shrouded in the glory of vendor lock-in. On the other, we have a veritable zoo of open source tools — many of which are great at addressing specific pain points, but are small pieces of a larger puzzle.

The failure of the space has spawned a loosely-organized grassroots movement in the devops community to address the challenge, and led to numerous blog posts, an IRC channel and a collection of GitHub repos.

Although there is a litany of complaints, I would submit that this pain is rooted in a way of monitoring that ignores the realities of growing and scaling businesses along with the demands of fast-paced infrastructure teams. Perhaps it’s time for us to think of monitoring in a new way.

A new (old) model
During the Korean War, United States Air Force Colonel Robert Boyd formulated the “OODA loop.” OODA stands for observe, orient, decide and act. Boyd theorized that the faster a team could understand what’s happening, orient themselves to the situation, decide how to respond to it, and act — the greater their readiness and haste of response. Boyd’s insight suggests that teams iterating through the loop faster gain a competitive advantage over opponents. I’d suggest that any well-designed monitoring tool can help automate the OODA loop for operations teams. Below are the essential components of monitoring infrastructure for fast-paced teams.

1. Deep integration
Most open source monitoring tools only tackle one aspect or a subset of the OODA loop. For instance: Graphite and Cacti provide trending (orientation), Nagios provides alerting (decision and action) and Statsd and Collectd gather metrics (observation). But integrating these projects is a daunting task and often takes the form of a Frankenstein’s monster of Perl scripts and PHP dashboards. While each of these tools are helpful, they only paint part of the picture. An ideal tool might integrate all four steps of the OODA loop into one harmonious system. Where necessary, one would also expect API endpoints to allow for custom behavior and flexibility to further automate a team’s action.

2. Contextual alerting and pattern recognition
Most monitoring tools require the user to predefine all of the conditions on which to alert. For instance, one would set static thresholds that say, “Notify me when disk usage goes above 90 percent,” or, “Notify me when CPU usage goes above 75 percent.” However, static thresholds are a poor substitute for pattern recognition, the basis of cognitive decision-making. Setting static thresholds for applications whose load varies throughout the day, week, or month is hell. At any given point, monitoring infrastructure should be able to reflect upon its current state, past state, and forecasting and ask, “Are current trends sufficiently deviant enough to warrant action?” And if so, it should immediately notify the team with context. What if ops teams could look at a graph and say to the system, “Alert us when something looks (or doesn’t look) like this?”

3. Timeliness
The term “real time” has been watered down, but it carries a specific meaning. Real-time computing concepts in monitoring systems relate to an intrinsic property of events: they happen on a timeline. Monitoring systems must be real time, because the timeliness of the data impacts its correctness and utility. All aspects of a monitoring system must respond immediately to events. The OODA loop is only effective when it is faster than the environment or opponent that it is running against. If you’re operating on assumptions that are a minute old, it’s hard to say much of anything about what’s happening now.

4. High resolution
The resolution of monitoring systems is critical. With most options offering updates once every one to five minutes, low-resolution monitoring obscures a world of patterns that are invisible until you’ve zoomed in. The difference between a one-second graph updated in real time and a one-minute graph updated every five minutes is the difference between a fluid HD film and a paper flip-book.

5. Dynamic configuration
The fluidity of modern architectures demands monitoring infrastructure that can keep up with the changes that ops teams require. The rise of virtualized infrastructure combined with dynamic configuration management systems means that there may be a great deal of host churn. This churn challenges the concepts of host identity that traditional monitoring tools have built in as fundamental abstractions.

What’s next for monitoring?
The pace of business today requires tools that help teams move rapidly through the OODA loop. Smarter software can push this process forward, offering deep integration across infrastructure, pattern recognition to quickly spot problems, real-time updates at high resolution and automatic adaptation to changing environments. With a set of tools like that, operations teams can respond to incidents, resolve them for good and drive business value while leaving competitors’ heads spinning.

Cliff Moon is the founder and CTO of Boundary, a provider of real-time network and applications monitoring-as-a-service. The views expressed here are personal and do not necessarily reflect those of any Company with which he is or has been affiliated.

Special thanks and acknowledgement goes out to Coda Hale for his views on monitoring and metrics (read Metrics, Metrics Everywhere).

Image courtesy of Flickr user purpleslog.


Eric Anderson

Regarding real-time and high resolution:
Totally agree. High resolution real-time analytics and information went from ‘cool tech’ to ‘must have’ in the course of a year, and any monitoring system that does not do a high sampling rate (I’m talking seconds, not minutes) is way behind the curve.

This article was written by the CTO of Boundary (network traffic monitoring) – they understand the importance of high resolution, and I agree with them. I’m biased though too – I’m the Co-Founder of CopperEgg, and we also do high resolution real-time monitoring (for systems/cloud).

Sure, the ‘monitoring sucks’ tagline is marketing speak, but it’s true.


Lester Rivera

Improving monitoring tools by increasing the scalability and speed of collection, or improving the display of high volumes of data, has its place. As does using new technologies and platforms to support new types of applications and architectures. But, with apologies, I don’t believe that any of it represents an evolution in monitoring.

The problem that require a significant evolution in monitoring tools as I see it has less to do with timeliness of data, or even how it is displayed, but empowerment of the Ops teams. The Ops team tends to be the folks with the least flexibility and decision-making power to respond to any complicated event on the systems they may be monitoring for; typically, that requires bringing to the table subject matter experts and some sort of decision maker. Millisecond resolution on events means little when it takes 15-30 minutes to alert, awake, and get eyes on the screen of the DBA, Developer, Security Analyst and/or some type of Manager. If you pardon the silly euphemism…Rome is burning while the Ops team watches.

A typical Ops team member may not have the proper expertise to diagnose any type of complex situation, and has limited options to respond to a situation. I believe that any real evolution in monitoring increases the Ops team’s capability to understand and diagnose the evolving situation and expands their options to respond quickly and with full confidence of decision-makers; that is, the monitoring solution provides an Ops team improved situational awareness combined with expert knowledge and dynamic response options. Ok, that last part was a mouthful of marketese as I’ve ever heard.

How I see this working in practice is that log events raise alerts based on pre-programmed expertise packaged by very smart subject matter experts on the components that comprise the application architecture; for example, Apache, Tomcat, MySQL, Ubuntu Unix, Amazon AWS, etc. Combine the alerts with up-to-date man pages or technet-style information packaged to allow the Ops team to understand what the alert is trying really telling you; that is, something more the tweet and less than a wikipedia page — although a link or two to appropriate web resources won’t hurt. And finally provide a set of suitable response options; whether it’s simply some suggested next-steps or actual pre-configured scripts, commands, and actions.

Monitoring tools like this exist today, but are hardly commonplace. I helped architect a software solution to do this over 10 years ago. The lesson learned was that while the technology to add these functions aren’t terribly difficult, the pre-packaged knowledge to place inside this system is difficult and expensive; that is, it always has a freshness date and requires a community of expertise to improve and keep it up to date.

That said, all these barriers can be overcome. I might need to to tackle it myself again.

Holger Schulze

The conversation mirrors the findings of eG Innovations performance monitoring survey. You can download the full report “Performance Management in Transformational IT Environments” here: http://www.eginnovations.com/survey2012


• IT operations professionals struggle to manually diagnose root cause problems in heterogeneous service environments – 54% of IT professionals surveyed said their single biggest performance management challenge is identifying the root cause of slow application performance in such heterogeneous environments. IT operations staff can’t tell where the root of the problem is when a user says performance is “slow”: Is it the Network? Server? Database? Virtual Machine? Today’s manual approach to troubleshooting performance issues is woefully inadequate in dealing with the complexity of transformational IT environments. In response, a majority of 67% would like to see performance management solutions that help them automate and accelerate the diagnosis and resolution of performance issues in heterogeneous service environments.

• Managing virtual and physical machines is a challenge – A very high percentage of respondents (53%) are looking for solutions that can manage performance of both virtual and physical machines, as well as the applications running on them, reflecting the new dynamics and inter-dependencies introduced by virtualization layers of IT environments.

• IT wants integrated performance monitoring across all infrastructure tiers – 60% of respondents say they are using multiple, independent performance monitoring tools today – one for the network, another for servers, another for the database and one for the virtualization tier. This fragmented silo approach does not work anymore, and an overwhelming 66% say they are looking for comprehensive solutions that provide answers to cross-layer performance issues.

• Poor service performance impacts customer satisfaction – The #1 business impact of poor IT service performance is decreased customer satisfaction say 53% of respondents. Decreased user productivity (44%) and an increase of IT operations cost (44%) as a result of IT performance issues are of major concern among IT operations professionals.

• IT operations professionals are looking for proactive problem solving – 63% of respondents are looking for performance management solutions that allow them to become more proactive and find problems before they impact the user experience.

For the complete survey results, visit http://www.eginnovations.com/survey2012

Holger Schulze
eG Innovations


thinking a little bit bigger – no one wants a tool that “finds problems before they impact the user” that want applications and the infrastructure that self regulates itself and adapts (profiling, protecting, policing, predicting, provisioning…). What’s the point telling an ops person about a predicted problem before it happens…when it also predicts that it will not be a problem by the time the ops person attempts to act in the timescale that computing considers years.

Aaron Rudger

It’s clear the ops and dev communities need better tools. Still too many groups marshaling disparate charts in “situation room” conf calls when something bad happens. But vendors *are* getting better at #1–e.g. web API integration–to make it easier to parse from end-user perf down to infrastructure.

Christopher Haire

What’s wrong with the model of open source solutions tackling one aspect of the problem. Usually tools created to do the job of many in a one-size-fits-all manner end up failing equal the right mix of tools that do one job well. Using a mix of solutions for your various pieces also allows for replacing those solutions as better pieces come along.

Gil Shalit

In addition to monitoring, that is reactive in nature, Enterprises should consider the pro-active business analytics approach. The problem with monitoring is that huge performance sinks are hidden from it, simply because they have always been part of the system. A bad index that has been in a banking application since the 90’s is compensated for by excess hardware and by a workflow that provides enough time for the batch operation to complete. No one knows that adding a column to that index can shave 50% of a certain I/O operation, and no monitoring solution will alert you for that.

The huge amount of information collected by monitors, RDBMS optimizers and in-house solutions is a real haystack with performance improvement gold hiding in it. For DB2 based enterprise applications, our company http://www.InnovizeIT.biz has been able to save millions annually by finding hidden performance sinks that monitoring tools miss.

Adam Vandenberg

Splunk == Hella expensive vendor lock-in (though it is a reasonable product)

Mahendra Kutare

Monitoring and analysis needs to come along together for next generation monitoring systems. The point which we made in this paper – Monalytics: online monitoring and analytics for managing large scale data centers.
http://dl.acm.org/citation.cfm?id=1809073. There are none software companies who are building anything close to this work yet. Large scale distributed systems specially datacenters and cloud systems will need to build monitoring to detect, analyze and act/recover of/from faults as soon as possible to fend off catastrophic failures. Local and global control loops which are called old(new) model here are proposed methods/techniques in our work here. There is no current implementation that provides anything close to the features that are designed and implemented as part of Monalytics system.

Many of the current implementations such as Ganglia, Nagio provide static topologies and passive observation through visualization that are not going to scale for dynamism and reconfiguration properties of cloud systems.

I would say I agree with the premise of this article, but there is lot more than what I proposed in this article. Sadly, many still feel that this is spin :) I would say wait till world of datacenter and cloud systems with emergent phenomenon results in a complex system thats hard to understand and analyze.

Please go through this paper and email me if you have any feedback at – mahendra.kutare@gmail.com


Is there a copy of this paper anywhere online other than behind the ACM paywall?


Monitoring sucks = marketing spin, disappointed gigaom publishing a marketing piece as real issue. See netuitive, quest foglight, uptimesoftware as reasonable cost options. Virtulization dramatically reduces monitoring cost, The real issue is current survey reports 24% of organizations don’t do any performance monitoring, no excuse for this

John E. Vincent

@parkercloud As the guy behind the blog posts/github repos and general bitching, I can assure you there’s nothing marketing related behind it. I find it funny you would mention commercial products out of one side of your mouth whilst lambasting a non-commercial non-vendor specific post out of the other.

If the tools you mentioned met the needs of the actual people who used them (hint: they don’t), you wouldn’t see such a groundswell of frustration from people who have to use them.

Fred Leland

I am not an IT guy, but i do practice and teach the concepts of COL John Boyd’s work. I love your reference to OODA Loops and how you apply them here. Boyd’s ideas are making inroads and making more effective numerous disciplines such as; Research and development, leadership methodologies, law enforcement effectiveness, security and here the IT world. Great for Boyd’s legacy.


An article about monitoring written by the CTO of a new monitoring company… hmmm.

Comments are closed.