The particles, geometric art, line and dot of walking.

Key Criteria for Evaluating AIOps Solutions v2.0

An Evaluation Guide for Technology Decision Makers

Summary

Cloud computing and the accelerating pace of development, particularly in response to effects from the global pandemic, have been key drivers behind the growth and adoption of AIOps concepts and tooling. As cloud and edge computing have driven operational complexity inexorably upward, IT organizations have turned to automation to address the needs of operations teams.

In fact, without AIOps solutions coming online to bear the load, IT budgets would have been overwhelmed by increases in staff spending, as Figure 1 makes very clear.

Figure 1. Projected IT Staff Spending Based on Available Operations Solutions

Until recently, most traditional IT operations tools offered limited automation and autonomy—bolting an AI engine onto an existing ITOps tool and calling it AIOps—regardless of whether the tool leveraged AI systemically or not. However, some AIOps tool startups have released purpose-built solutions that leverage AI from the ground up. Over time, we’ve seen incumbent ITOps solutions combine with focused AIOps tools to provide more comprehensive automation and tooling. Larger cloud providers are also starting to enter this active market.

There’s good reason for the excitement. Intelligent AIOps solutions can go beyond the rote triggers and preprogrammed actions of traditional ITOps tools to enable enlightened, experience-based responses based on current events and historical patterns. The objective: a fully automated system that’s able to spot and correct issues without the knowledge or participation of humans. This can sharply reduce mean time to repair (MTTR) incidents even as it reduces IT headcount.

Every vendor has its own spin on what AIOps is, or should be, but there are some common patterns emerging. Indeed, as the stakes in an exploding market go way up, a set of repeating patterns, as well as unique capabilities, appears to be emerging.

AIOps Key Findings

As we carried forward with our briefings, looking at the array of AIOps tools in detail, we came to a few core conclusions:

First, AIOps tools today are in a state of rapid evolution, with the tool providers having different takes on what AIOps tools should do, how they are applied, and the value they are likely to bring. For example, some are oriented toward data gathering, with excellent connectors and data acquisition systems, but they may be weak at pattern finding and analysis. Others may be excellent in discovering correlations in patterns, when revealing causation would be more valuable.

Translating this into real-world IT, it’s like saying the app is slow because the database is slow, when the real reason everything is slow is an oversubscribed network path that’s causing lots of retransmissions. The better tools we looked at would have found the network path issue, while lesser tools might have suggested increasing the size of the server hosting the database.

In addition, most tools today don’t provide native self-healing orchestration engines, instead opting for third-party orchestration providers leveraging their APIs. That’s good for enterprises with their own orchestration teams, as an AIOps tool without automation reduces the risk of configuration sprawl. For companies without centralized orchestration tools, however, an AIOps tool with orchestration ability reduces tool sprawl.

Second, there seems to be two types of AIOps players: those that build upon existing operations tools and those that are new to the AIOps space. Each has tradeoffs. New start-ups have the ability to create an AIOps tool purpose-built for modern system issues, such as hybrid and multi-cloud management. However, these start-ups may be missing connections and analytic capabilities critical to more traditional systems. In contrast, traditional tools often bolt an AI and analytic engine onto their mature offerings, targeting modern systems such as public clouds to stay relevant in the market. These incumbent providers also sometimes purchase companies that better support CloudOps to get into the game—and even then the integration can be awkward.

Both approaches are valid and your requirements will determine which way to go. In some cases, you may need more than a single tool to satisfy all your operational requirements. As time goes on, the functions of all these tools will likely normalize, and the features and values will converge.

Third, this year’s report features more tools with self-healing capabilities—an area that was a point of weakness for many tools in last year’s report. While a few support self-healing functions with native process management (such as rebooting a network device or restarting a database), many leverage open APIs to integrate with third-party process orchestration tools. This means that enterprises will have more flexibility in the brands of process orchestration tools they can leverage, a desirable attribute in the current market. This will likely drive independent software vendors or internal development groups that did not create management APIs for their applications to do so, in order to remain relevant to organizations using AIOps tools. In large enterprises with orchestration tools, the focus is on AIOps tools that can intelligently call the orchestration tool to automate remediation.

To address the market for built-in orchestration, AIOps vendors will need to invest in their solution’s ability to make changes directly to systems or software they did not deploy the affected system to or on. Smaller companies will have to not only maintain the original processes to support continuous deployment but also the settings in the AIOps tools to automate remediation. Vendors that offer orchestration tools connected to continuous deployment workflows will increase their value by adding AIOps, but very large organizations may have more than one CI/CD tool. Moreover, the CI/CD tool and the AIOps tool might be purchased by different buyers. In any case, an orchestration-agnostic AIOps tool would have more value.

Evaluating an AIOps Tool

When looking at these tools, there are six capabilities you’ll want to make sure your selection includes:

  • Data gathering and selection
  • Finding patterns in the data
  • Data analysis
  • Support for operations teams’ collaboration
  • ITSM integration, such as opening and closing tickets
  • Automated response generation

Tools integration, which is now promoted as a core component of AIOps, may well mean dealing with operations data at heightened security and governance levels. Thus, when picking tools, the focus should be on how the data is gathered, such as an agent or agentless approach, and most importantly, how patterns in the data are determined. For example, you might see a lot of packet failures based on 200 pieces of data out of several million. The tool should help you determine which correlations have meaning within the selected data, analyze the likely root cause issue, and enable collaboration with automated and non-automated processes (humans) to resolve the problem.

Those tool providers that have aligned AIOps processes and collaboration with DevSecOps approaches provide an advantage as issues will be reported directly back to the developer, who is in the best position to resolve most application-related problems. This process becomes part of the collaboration around issue resolution and could reduce problem diagnostics and solutions to a few hours, instead of the more typical days or weeks.

Deployment models vary from provider to provider, generally according to whether it is traditional or a startup. With startups, tool use tends to be on-demand and the solution may actually be hosted in a public hyperscale cloud, such as AWS, Microsoft Azure, or the Google Public Cloud. These tools are the easiest to acquire and deploy but will have some difficulty reaching back into an enterprise data center if it is part of a hybrid or multi-cloud operational domain.

The more traditional players provide on-premises AIOps tool deployments, with some offering a hybrid approach as well, either on-demand via the cloud or locally hosted in your environment. While you would think that enterprises would gravitate to on-demand, many still prefer running AIOps tools on-premises, as they do with security and governance tools. Most AIOps tools consume log, metadata, and telemetry data, pushing this back to a central cloud or on-premises location becomes a problem. Newer designs allow for decentralized pre-processing of data at optimal locations, forwarding only actionable intelligence to a central system and its DR counterpart. This means that for vendors to service customers with large on-premises data centers, event sources will need to support remote collection and processing.

Integration with other tools improves year over year in the AiOps marketplace. This results in more comprehensive ways for operating systems to work with security, governance, and DevOps tools. For example, a security tool would be aware that CPU or I/O saturation is occurring, which could mean a potential breach attempt. It also enables consuming data from various sources, including the network, storage, host (VM, physical servers, containers, cloud IaaS/PaaS, or on-premises), for various uses, such as end-user monitoring (observed or synthetic or both), and deep application metrics for mainframe, JEE, complex package apps like CRM, ERP, BPM, and data warehousing packages.

The ability to do automated remediation is critical to the success of an AIOps tool. Of course, it’s useful if a tool can find issues and even collaborate with humans. However, the automated resolution of problems and moving closer to a “no-ops” model where humans are minimized in the process is clearly the ultimate destination, and including and supporting automated remediation in a scalable and reliable way is critical.

How to Read this Report

This GigaOm report is one of a series of documents that helps IT organizations assess competing solutions in the context of well-defined features and criteria. For a fuller understanding consider reviewing the following reports:

Key Criteria report: A detailed market sector analysis that assesses the impact that key product features and criteria have on top-line solution characteristics—such as scalability, performance, and TCO—that drive purchase decisions.

GigaOm Radar report: A forward-looking analysis that plots the relative value and progression of vendor solutions along multiple axes based on strategy and execution. The Radar report includes a breakdown of each vendor’s offering in the sector.

Solution Profile: An in-depth vendor analysis that builds on the framework developed in the Key Criteria and Radar reports to assess a company’s engagement within a technology sector. This analysis includes forward-looking guidance around both strategy and product.

Full report available to GigaOm Subscribers.

Subscribe to GigaOm Research