Table of Contents
1. Summary
Artificial intelligence for IT operations (AIOps) encompasses the technologies that automate, identify, and resolve IT issues. Additionally, it can predict and automatically resolve concerns before they become problems.
The infrastructure, services, and applications in an enterprise produce different types of data, including metrics, performance data, and log data. As a key component in the development of operational and organizational awareness, AIOps combines data and information from across the enterprise to improve root cause analysis, prediction, and automatic response, reducing time to resolution (MTTR) and producing better enterprise-level outcomes. Moreover, the integration of business intelligence (BI) data enables AIOps to answer questions about the state of the entire business.
By providing a link between IT operations and business operations, an AIOps solution helps answer questions about the state of the entire organization (Figure 1).
Figure 1. Organizational Awareness is Built on Operational Awareness and Business Awareness
Organizational awareness requires the monitoring and observability of IT systems and operations. The relationships among monitoring, observability, and awareness (MOA) are shown in Figure 2. MOA refers to a process by which data about operations (IT) and business (people and processes) is tracked and evaluated to enable a company to develop organizational awareness.
- Monitoring provides the state of that single system and metrics about it (performance or a break/fix condition).
- Observability combines the state of multiple systems and asks additional questions about the health of these systems as a whole (such as why devices, systems, or applications are behaving a certain way within the context of IT).
- Awareness brings together all information about the company to evaluate whether operations (IT) and business (people and processes) are OK, what is likely to break next, and how to prevent problems before they are monitored or observed.
AIOps solutions perform MOA functions—either internally or by ingestion of data from other tools—and as such, this becomes a critical consideration when evaluating AIOps solutions.
Figure 2. Monitoring, Observability, and Awareness Relationship
Vendor selection should be based on the needs of the organization, and the process involves more than just evaluating the technical capabilities of solutions. AIOps can be disruptive to organizations, and if the IT operations team lacks the technical and functional ability to move with changing business strategies, the political capital that may be required for successful implementation of an AIOps solution can be substantial.
With domain-specific tools, existing IT tools for monitoring and observability may have to be replaced or else the organization may have to live with redundant data gathering or analysis without retiring the displaced tools. However, if the displaced tools are retired, a single vendor can provide a solution that spans from monitoring to awareness.
In contrast, domain-agnostic solutions can be layered over existing tools, does not make them redundant, and requires less political expenditure. The challenge then becomes obtaining data from the various silos within the enterprise. Friction may be lower, but the time to value may be longer.
This GigaOm Radar report highlights key AIOps vendors and equips IT decision-makers with the information needed to select the best fit for their business and use case requirements. In the accompanying GigaOm report, “Key Criteria for Evaluating AIOps Solutions,” we describe in more detail the capabilities and metrics that are used to evaluate vendors in this market.
How to Read this Report
This GigaOm report is one of a series of documents that helps IT organizations assess competing solutions in the context of well-defined features and criteria. For a fuller understanding, consider reviewing the following reports:
Key Criteria report: A detailed market sector analysis that assesses the impact that key product features and criteria have on top-line solution characteristics—such as scalability, performance, and TCO—that drive purchase decisions.
GigaOm Radar report: A forward-looking analysis that plots the relative value and progression of vendor solutions along multiple axes based on strategy and execution. The Radar report includes a breakdown of each vendor’s offering in the sector.
2. Market Categories and Deployment Types
To better understand the market and vendor positioning (Table 1), we assess how well AIOps solutions are positioned to serve specific market segments and deployment models.
For this report, we recognize the following market segments:
- Small-to-medium business (SMB): In this category, we assess solutions on their ability to meet the needs of organizations ranging from small businesses to medium-sized companies. Also assessed are departmental use cases in large enterprises where ease of use and deployment are more important than extensive management functionality, data mobility, and feature set.
- Large enterprise: Here, offerings are assessed on their ability to support large and business-critical projects. Optimal solutions in this category have a strong focus on flexibility, performance, data services, and features to improve security and data protection. Scalability is another big differentiator, as is the ability to deploy the same service in different environments.
- Managed service provider (MSP): MSPs are enablers that remotely manage a customer’s network operations and deal with maintenance, upgrades, and other day-to-day activities. Their needs may align with those in the above categories, and solutions are assessed on ability to meet them.
In addition, we recognize three deployment models for solutions in this report:
- SaaS: These solutions are available only in the cloud. Often designed, deployed, and managed by the service provider, they are available only from that specific provider. The big advantage of this type of solution is the integration with other services offered by the cloud service provider (functions, for example) and its simplicity.
- On-premises: The solution can be deployed entirely on-premises to meet requirements for security and privacy not available with SaaS solutions.
- Hybrid and multicloud: These solutions are meant to be installed both on-premises and in the cloud, allowing them to build hybrid or multicloud infrastructures. Integration with a single cloud provider could be limited compared to the other option and more complex to deploy and manage. On the other hand, multicloud deployments are more flexible, and the user usually has more control over the entire stack with respect to resource allocation and tuning.
Table 1. Vendor Positioning: Market Segment and Deployment Model
Market Segment |
Deployment Model |
|||||
---|---|---|---|---|---|---|
SMB | Large Enterprise | MSP | SaaS | On-Premises | Hybrid & Multicloud | |
BigPanda | ||||||
BMC | ||||||
Broadcom | ||||||
Centerity | ||||||
Cisco | ||||||
CloudFabrix | ||||||
Datadog | ||||||
Digitate | ||||||
Dynatrace | ||||||
Elastic | ||||||
Evolven | ||||||
HCLSoftware | ||||||
IBM | ||||||
Interlink | ||||||
ITRS Group | ||||||
Logic Monitor | ||||||
Logz.io | ||||||
meshIQ | ||||||
Moogsoft | ||||||
Netreo | ||||||
New Relic | ||||||
OpenText | ||||||
OpsRamp | ||||||
PagerDuty | ||||||
ScienceLogic | ||||||
ServiceNow | ||||||
Splunk | ||||||
Sumo Logic | ||||||
Zenoss | ||||||
ZIF |
Exceptional: Outstanding focus and execution | |
Capable: Good but with room for improvement | |
Limited: Lacking in execution and use cases | |
Not applicable or absent |
For this evaluation, we mainly looked at offerings in a binary way, rating vendors (++) if they support that market segment and deployment model and (-) if they do not.
However, there are a few exceptions in which vendors received a (+), indicating that while a solution can technically serve that market segment or deployment model, it’s not recommended due to limitations that are discussed in greater detail in the capsules.
3. Key Criteria Comparison
Building on the findings from the GigaOm report “Key Criteria for Evaluating AIOps Solutions,” Tables 2, 3, and 4 summarize how well each vendor included in this research performs in the areas we consider differentiating and critical for the sector.
- Key criteria differentiate solutions based on features and capabilities, outlining the primary criteria to be considered when evaluating an AIOps solution.
- Evaluation metrics provide insight into the non-functional requirements required by many organizations.
- Emerging technologies show how well each vendor takes advantage of technologies that are not yet mainstream but are expected to become more widespread and compelling within the next 12 to 18 months.
The objective is to give the reader a snapshot of the technical capabilities of available solutions, define the perimeter of the market landscape, and gauge the potential impact on the business.
Table 2. Key Criteria Comparison
Key Criteria |
|||||||||
---|---|---|---|---|---|---|---|---|---|
Data Collection & Analysis | Automated Remediation | Customizable Dashboards | Predictive Capabilities | BI Integration | Multicloud Support | Service-Level Management | Task & Workflow Management | Value Metrics | |
BigPanda | |||||||||
BMC | |||||||||
Broadcom | |||||||||
Centerity | |||||||||
Cisco | |||||||||
CloudFabrix | |||||||||
Datadog | |||||||||
Digitate | |||||||||
Dynatrace | |||||||||
Elastic | |||||||||
Evolven | |||||||||
HCLSoftware | |||||||||
IBM | |||||||||
Interlink | |||||||||
ITRS Group | |||||||||
Logic Monitor | |||||||||
Logz.io | |||||||||
meshIQ | |||||||||
Moogsoft | |||||||||
Netreo | |||||||||
New Relic | |||||||||
OpenText | |||||||||
OpsRamp | |||||||||
PagerDuty | |||||||||
ScienceLogic | |||||||||
ServiceNow | |||||||||
Splunk | |||||||||
Sumo Logic | |||||||||
Zenoss | |||||||||
ZIF |
Exceptional: Outstanding focus and execution | |
Capable: Good but with room for improvement | |
Limited: Lacking in execution and use cases | |
Not applicable or absent |
Table 3. Evaluation Metrics Comparison
Evaluation Metrics |
||||
---|---|---|---|---|
Cost & Value | Ease of Use | Scalability | Manageability & Maintainability | |
BigPanda | ||||
BMC | ||||
Broadcom | ||||
Centerity | ||||
Cisco | ||||
CloudFabrix | ||||
Datadog | ||||
Digitate | ||||
Dynatrace | ||||
Elastic | ||||
Evolven | ||||
HCLSoftware | ||||
IBM | ||||
Interlink | ||||
ITRS Group | ||||
Logic Monitor | ||||
Logz.io | ||||
meshIQ | ||||
Moogsoft | ||||
Netreo | ||||
New Relic | ||||
OpenText | ||||
OpsRamp | ||||
PagerDuty | ||||
ScienceLogic | ||||
ServiceNow | ||||
Splunk | ||||
Sumo Logic | ||||
Zenoss | ||||
ZIF |
Exceptional: Outstanding focus and execution | |
Capable: Good but with room for improvement | |
Limited: Lacking in execution and use cases | |
Not applicable or absent |
Table 4. Emerging Technologies Comparison
Emerging Tech |
||||
---|---|---|---|---|
AutoML | Cybersecurity Technologies | Explainable AI | AI Bias Management | |
BigPanda | ||||
BMC | ||||
Broadcom | ||||
Centerity | ||||
Cisco | ||||
CloudFabrix | ||||
Datadog | ||||
Digitate | ||||
Dynatrace | ||||
Elastic | ||||
Evolven | ||||
HCLSoftware | ||||
IBM | ||||
Interlink | ||||
ITRS Group | ||||
Logic Monitor | ||||
Logz.io | ||||
meshIQ | ||||
Moogsoft | ||||
Netreo | ||||
New Relic | ||||
OpenText | ||||
OpsRamp | ||||
PagerDuty | ||||
ScienceLogic | ||||
ServiceNow | ||||
Splunk | ||||
Sumo Logic | ||||
Zenoss | ||||
ZIF |
Exceptional: Outstanding focus and execution | |
Capable: Good but with room for improvement | |
Limited: Lacking in execution and use cases | |
Not applicable or absent |
By combining the information provided in the tables above, the reader can develop a clear understanding of the technical solutions available in the market.
4. GigaOm Radar
This report synthesizes the analysis of key criteria and their impact on evaluation metrics to inform the GigaOm Radar graphic in Figure 1. The resulting chart is a forward-looking perspective on all the vendors in this report based on their products’ technical capabilities and feature sets.
The GigaOm Radar plots vendor solutions across a series of concentric rings, with those set closer to the center judged to be of higher overall value. The chart characterizes each vendor on two axes—balancing Maturity versus Innovation and Feature Play versus Platform Play—while providing an arrow that projects each solution’s evolution over the coming 12 to 18 months.
Figure 1. GigaOm Radar for AIOps
As you can see in the Radar chart in Figure 1, the upper right Maturity/Platform Play quadrant is especially crowded, and many of the vendors are within the Leaders circle. The market for AIOps has become more congested as major players in application performance monitoring (APM), IT service management (ITSM), infrastructure monitoring and observability, and other disciplines add ML and AI to their solution sets. Though many can perform as domain-agnostic AIOps solutions, additional benefits come from using the entire platform. In some cases, particularly for those with an APM background, replacement of existing tools is mandatory. Cisco (AppDynamics), Dynatrace, and New Relic began as APM tools. Large ecosystem players like BMC and IBM have an abundance of tools that span the range from mainframes to cloud computing.
The upper left Maturity/Feature Play quadrant includes vendors that are more domain-agnostic or are open source solutions. Tool displacement is not required, though some solutions, such as BigPanda, PagerDuty, and Zenoss, can provide additional features when tools are displaced. Moogsoft, with its strong AI features, continues to be an excellent, totally agnostic AIOps solution. Open source solutions, such as Elastic or Logz.io, can be highly flexible for enterprises with the requisite technology skills and the correct support structure from the vendor.
Vendors in the Innovation/Feature Play quadrant have distinctive capabilities but a more narrow focus. Solutions such as Netreo have auto-remediation and interesting AI capabilities that stand out from the market as a whole. Evolven, like meshIQ in the lower right Innovation/Platform Play quadrant, presents unique capabilities that may not fit totally under the AIOps umbrella. Evolven concentrates on configuration changes down to the parameter level anywhere within the enterprise, giving visibility to unseen changes from DevOps, infrastructure management, or any parameter or setting anywhere. CloudFabrix remains in this quadrant because it continues to improve its solution, and has become a Leader owing to its data fabric, bots, and pipeline architecture.
Vendors in the Innovation/Platform Play quadrant have distinctive capabilities and are more platform oriented. The standout here is meshIQ, whose concentration on message queues continues to make it important to transaction-based enterprises. Though the company has added more ML features, it is the ability to see inside queuing systems that makes meshIQ unique. Other players can show either side the inputs and outputs, but meshIQ provides the internal insight often needed in financial and other transactions.
Inside the GigaOm Radar
The GigaOm Radar weighs each vendor’s execution, roadmap, and ability to innovate to plot solutions along two axes, each set as opposing pairs. On the Y axis, Maturity recognizes solution stability, strength of ecosystem, and a conservative stance, while Innovation highlights technical innovation and a more aggressive approach. On the X axis, Feature Play connotes a narrow focus on niche or cutting-edge functionality, while Platform Play displays a broader platform focus and commitment to a comprehensive feature set.
The closer to center a solution sits, the better its execution and value, with top performers occupying the inner Leaders circle. The centermost circle is almost always empty, reserved for highly mature and consolidated markets that lack space for further innovation.
The GigaOm Radar offers a forward-looking assessment, plotting the current and projected position of each solution over a 12- to 18-month window. Arrows indicate travel based on strategy and pace of innovation, with vendors designated as Forward Movers, Fast Movers, or Outperformers based on their rate of progression.
Note that the Radar excludes vendor market share as a metric. The focus is on forward-looking analysis that emphasizes the value of innovation and differentiation over incumbent market position.
5. Vendor Insights
BigPanda AIOps
Founded in 2012, BigPanda’s AIOps solution serves hundreds of customers worldwide. Its target market is medium to large enterprises and MSPs. BigPanda is a multitenant SaaS product with multiple instances that support both US and EU deployment to satisfy GDPR compliance.
BigPanda AIOps provides a scalable data ingestion framework that ingests data from more than 50 sources—including OpenTelemetry—out of the box. The solution also connects with various monitoring and data sources using REST APIs, SNMP agents, and an email parser. The solution integrates with and supports multiple public clouds and internal private clouds.
BigPanda’s dashboards can display custom views of any data ingested by the solution, including logs, metrics, traces, and events; dashboards can be customized for support groups and other organizational entities. Alerts are enriched with technical information from runbooks, DevOps CI/CD pipelines, tags, and additional business context information. Users can easily create, customize, and save dashboards.
BigPanda’s Unified Analytics delivers self-service analytics and dashboards to create custom KPIs based on IT operations data. It also delivers a library of value use-case dashboards that can be modified to reflect each organization’s MTTx metrics and KPI data.
Though a specific service-level management interface is not available (it does not support direct management of service levels), preconfigured dashboards can be customized to facilitate management.
Depending on the data available, BigPanda can predict probable root cause changes to the environment, which are shown as pending incidents. Seasonality predictions are not possible.
BigPanda offers several integration options for BI tools, including REST APIs, webhooks, and connectors to popular BI platforms such as Tableau and Power BI.
BigPanda integrates with a wide range of task and workflow management and orchestration tools, such as Ansible, Puppet, and Chef. BigPanda’s task and workflow management capabilities include a centralized incident management interface, automated workflows to reduce the need for manual intervention, and integration with third-party tools such as Jira, ServiceNow, and Slack.
To facilitate automated remediation, BigPanda offers bi-directional task and workflow integrations with external tools to reduce manual intervention and automate workflows. Runbooks can be used to support automated remediation.
AutoML is not available. The platform has a library of suggested AI correlation patterns used across all BigPanda customers based on customer usage. The popularity and expected impact of the pattern is displayed along with details on why the pattern is suggested.
BigPanda is not a dedicated cybersecurity solution and may require additional security technologies to fully protect the enterprise.
BigPanda calls its explainable AI Pragmatic AI. The logic of the ML suggestions is explained in plain English. Users add situational and tribal knowledge to the logic to strengthen it on their own without requiring expert data scientists.
There is no direct support for AI bias management; however, the Pragmatic AI provides some insight into new correlation patterns based on new event streams and predicting probable root cause changes.
BigPanda offers two licensing models: subscription or consumption-based pricing. For subscriptions, pricing is based on the number of devices, applications, and services monitored and managed by the platform. For consumption-based pricing, the number of alerts ingested and actioned upon are considered, allowing customers to start small and grow over time.
The solution is available in three editions: Alert Intelligence, Incident Intelligence, and Workflow Automation. BigPanda also offers a free trial. There is no specific need for tool displacement, though BigPanda also has monitoring and observation solutions. Professional services are available but not required for implementation.
BigPanda is easy to implement and use in the SaaS deployment. There are self-guided learning modules within BigPanda University as well as an in-app onboarding experience. These help to simplify deployment of the large number of possible configurations.
For the SaaS version, BigPanda handles scaling, management, and business continuity and disaster recovery (BC/DR). For the self-managed version, organizations will have to handle these aspects themselves. Professional services are available from BigPanda to support the self-managed deployments scale.
Strengths: BigPanda is a domain-agnostic AIOps solution with good data ingestion capabilities, which includes OpenTelemetry. Onboarding wizards can simplify deployment. Dashboards are also strong, as is multicloud support for public and private clouds.
Challenges: The predictive analytics is limited to pending failure. Direct service-level management is not supported; however, MTTx and KPIs can be defined using out-of-the-box dashboards.
BMC Helix AIOps
BMC Software is a global leader in enterprise software solutions. The company was founded in 1980 to manage mainframe computers. Today, BMC serves thousands of customers worldwide, including many Fortune 500 companies.
BMC Helix AIOps can be deployed in a variety of ways. On-premises, self-managed deployments, cloud deployments (SaaS) for BMC-managed environments, and hybrid deployments where organizations manage the on-premises environment and BMC manages the SaaS environment. The solution is also available for MSPs.
BMC Helix AIOps can ingest data from any source, including OpenTelemetry, without requiring programming skills. It can handle various data types such as logs, metrics, traces, topology, and events.
The solution offers automated remediation capabilities and integrates with existing workflow management and orchestration tools.
It provides advanced data visualization capabilities and customizable dashboards for near real-time insights that are easy to understand and act on. Data can be added to a dashboard easily after drilling into an area. The user can see all data correlated with the subject under examination.
BMC Helix AIOps uses ML algorithms to predict potential issues and failures, and it can forecast yearly trending and predict both future saturation and optimization.
The solution supports data from multicloud environments and provides visibility into all environments through a single screen. It offers capabilities to monitor and manage service level agreements (SLAs), service-level objectives (SLOs), and service-level indicators (SLIs).
BMC Helix facilitates collaboration and streamlines workflows between different IT teams, providing the ability to assign and track tasks, share data and insights, and communicate in real-time.
The solution captures MTTx value metrics and lets users define custom metrics via a low/no-code environment.
With regard to emerging technologies, BMC Helix AIOps offers AutoML leveraging knowledge graphs and generative AI (BMC HelixGPT) to automate the process of building and training ML models to improve root-cause isolation and help with prediction. It uses AutoML to reduce the complexity of creating models and speed up the time-to-value for customers. In the next 18 to 36 months, the company plans to enhance the AutoML capabilities to improve accuracy and scalability for even larger datasets.
BMC Helix AIOps leverages cybersecurity technologies such as threat detection, identity and access management, and security information and event management (SIEM) to help customers remain secure. Its roadmap indicates the company plans to continue incorporating and enhancing cybersecurity technologies in its product.
BMC Helix AIOps currently uses explainable AI to assist IT teams in understanding and trusting the decisions made by AI systems. In the future, it plans to further enhance its explainable AI capabilities to provide even more transparency into the decisions.
BMC has a dedicated team focused on ensuring its models are unbiased and align with the company’s ethical principles. Enhancements are planned in future releases.
BMC Helix AIOps is licensed on a per-node basis, with pricing varying based on the number of nodes and the length of the contract. Professional services may be required for initial deployment, depending on the complexity of the environment and the level of customization needed. The level of tool displacement required for deployment will depend on the specific use case and the existing tool landscape; however, the best value for many enterprises would be to use BMC for all tooling.
BMC Helix AIOps is designed to be scalable, both for SaaS and on-premises deployments. For on-premises deployments, BMC provides support to help choose the correct hardware for both the initial deployment and future expansion. The default data retention period for BMC Helix AIOps varies depending on the contract and can be extended at an additional cost.
For SaaS-only deployments, BMC Helix AIOps includes a BC/DR plan to protect against failure of a primary instance. For on-premises deployments, BC/DR is supported through a variety of methods, including backups and failover mechanisms.
Strengths: BMC has strong data ingestion capabilities and customizable dashboards. Service-level management and value metrics are better than average. Unlike most of the other vendors, it supports emerging technologies.
Challenges: BMC is a large platform player, which can be both a strength and a weakness. Displacement of tools is not a requirement but is recommended due to the breadth of solutions offered by BMC.
Broadcom DX Operational Intelligence
Broadcom is a global technology provider offering a broad portfolio of solutions spanning semiconductor and enterprise software segments.
Broadcom’s AIOps tool, DX Operational Intelligence, supports SMBs, large enterprises, and MSPs. It can be deployed on-premises, as a SaaS solution, or as a hybrid of the two. It also supports the use of public clouds for deployment using HELM (a package manager for Kubernetes). It can be deployed as a domain-agnostic AIOps tool.
Broadcom supports ingesting, normalizing, and correlating IT operations’ observability data mainframe to public clouds and customer experience technologies.
Broadcom has good auto-remediation capabilities and can support multiple orchestration platforms. The solution leverages standards through dashboards with natively built-in Grafana, including custom plug-ins and log data integration with OpenSearch (an Elastic-based open source tool). The predictive analytics capabilities include seasonality for capacity management.
Service-level management is included, allowing a better connection with desired business outcomes. Broadcom has a workflow engine within DX and can connect with external tools for ITSM and other services. Value metrics are provided out-of-the box and can be reconfigured if needed. BI data can be included using APIs.
Ease of use is above average, though with some weakness in documentation, which the current roadmap indicates will be improved. The ability to scale and maintain the solution is average to above average. The SaaS solution is very easy to maintain. The on-premises and self-managed versions are well supported, including professional services.
Scaling the SaaS version is handled by Broadcom, resulting in a lower load on the enterprise. Self-managed deployments will have to deal with hardware and software considerations, though Broadcom can assist. The design of a fault-tolerant self-managed solution is likely to need Broadcom’s professional services. BC/DR are handled by Broadcom for SaaS deployments. Self-managed users may require assistance.
Manageability of the SaaS solution is good, with Broadcom handling most software and hardware. Agents and collection software require some enterprise intervention but no more than with other AIOps vendors.
Dx Operations Intelligence has nascent support for autoML, cybersecurity technologies, explainable AI, and AI bias management. These emerging technologies are on the roadmap for Broadcom, although they are not currently marketable features.
Though AI bias management is a problem known to Broadcom, the company notes that the nature of AIOps makes bias less likely. However, bias management is in its future plans.
DX Operational Intelligence is licensed by normalized “device,” for which the count is based on the monitored entities ingested from different sources. A license manager is provided to assist in controlling costs. Professional services are available and recommended for large enterprises or when multiple Broadcom products are introduced to the environment. Tool displacement is not mandatory, as DX Operational Intelligence can consume data from anywhere. However, the greatest value from Broadcom comes from the integration of all monitoring, alerting, workflow, and other tools that are part of the Broadcom family of solutions.
Strengths: Broadcom provides a strong AIOps solution backed by a platform of tools that can improve any enterprise. It provides strengths in service-level management, workflow and task management, and value metrics. The SaaS solution is easy to scale.
Challenges: Broadcom is a platform solution that may challenge many enterprises not already using Broadcom tools. The cost and value can be difficult to determine.
Centerity CSM²
Centerity was founded in 2007 and, in 2015, acquired ActiveBase, a provider of database performance management solutions. Today it’s a global provider of AI-powered IT operations, analytics, and performance-monitoring solutions and serves a wide range of industries, including finance, healthcare, and telecommunications.
Centerity CSM² can be deployed as a physical appliance installed on a bare-metal server. In its containerized form, it can be installed using virtual machines (VMs) or in a public or private cloud. Although suitable for SMBs or large organizations, it is most often made available via MSPs. And via MSPs, the solution may be offered as a platform as a service (PaaS).
Data collection and analysis are good, with support for edge devices such as point-of-service terminals, ATMs, self-service kiosks, self-check-out machines, generators, 5G edge computing, security cameras, gas pumps, and other internet of things (IoT) devices. All data collection is executed by Centerity collectors and sensors with support for a wide range of protocols. There is currently no support for non-Centerity sources or OpenTelemetry.
Automated remediation and proactive operations, including rebooting elements or triggering scripts, are key features. Support for workflow management and orchestration is available by integrating with Ansible and other engines. Dashboards can be customized and support integration with ITSM systems. AI predictive capabilities are on the roadmap, but device failure predictions are supported at this time.
To integrate BI systems, the Centerity professional services team must become involved. Centerity supports multicloud environments with an agent or via an API to individually collect data from every cloud and then organize it into logical containers called dynamic service views (DSVs), which may include host groups and profiles that simplify the interpretation of complex environments and allow working at a service level. DSVs can drill down to gain complete visibility of each individual environment and element.
SLAs are expressed as KPIs within DSVs. SLAs can be represented as one of the layers of the DSV, and SLIs can be represented by the metrics integrated in such layers. Only data collected by a Centerity agent can be consumed.
Centerity does not directly support task and workflow management. It relies on the incident management system, where the integration allows only the creation of incidents. Workflow management is included in the current product roadmap.
Centerity supports standard MTTx metrics. Combination metrics such as mean time between critical failures (MTBCF) can be created using Centerity’s low-code connector. Centerity provides basic reporting capabilities over past open/closed incidents.
Centerity is licensed under an annual subscription model. Its capacity-based pricing is based on the number of endpoints or a more granular “per monitored metric” methodology. Professional services are essential for successful implementation. Silo-specific monitoring tools should be replaced by Centerity because the product supports data consumption only via its own agents and collectors. However, Centerity does work in conjunction with APM and incident management tools.
The onboarding process requires a two-day workshop for administrators provided by the professional services team, which increases the implementation time and maintenance efforts when administration duties change.
Dashboards are easily customizable, and Centerity supports third-party tools (Grafana) for additional features. Dashboards can be shared and changed on the fly. The environment is mostly low-code.
Training model values are established using unsupervised ML with an analytics-model to support autoML. AutoML is better than average, and Centerity’s roadmap hints at significant improvements in the future.
Cybersecurity is provided via Centerity’s business partners with hardware access control (HAC-1), inventory, and security capabilities. HAC-1 uses physical layer fingerprinting technology and ML to calculate a digital fingerprint of all devices and compares them against known fingerprints. The cybersecurity support requires additional vendor/partners and added cost. The current Centerity offering has no support for explainable AI or AI bias management. Future interactions may address these features; however, there is no clear direction from Centerity.
Centerity is a fully containerized solution and scales horizontally, either on-premises or via SaaS providers (MSPs). It provides BC/DR via horizontal scalability. For on-premises implementations, redundancy and a quorum can be implemented in multiple-node architectures. All configuration items are stored in the central repository for redundancy. The update/upgrade process and patching require downtime. Automatic deployment of agents is on the roadmap for Q2 2023. Scaling is average, but support for self-managed deployment is good.
Strengths: Automated remediation is a strong feature for Centerity, and the solution scales well for both SaaS and on-premises deployments.
Challenges: Tool displacement is a concern, as Centerity consumes data only from its own collectors and agents. There is no built-in support for OpenTelemetry.
Cisco AIOps
Founded in 1984, Cisco is a global technology company that designs and manufactures networking equipment, software, and services. The company has also grown through strategic acquisitions, including the purchase of companies such as AppDynamics, Webex, Meraki, and Duo Security.
Cisco AIOps is targeted at large and medium-sized enterprises. The solution can be deployed on-premises, in the cloud (SaaS), or as a hybrid model. Once deployed, Cisco AIOps can be integrated with existing infrastructure and tools using APIs and connectors.
Cisco AIOps enables data collection and analysis from a wide range of sources, including OpenTelemetry, without requiring programming skills. The solution handles logs, metrics, traces, events, and other relevant types of data. Cisco has built-in auto-remediation capabilities, and it integrates with existing automation management and orchestration tools via APIs. Its advanced data visualization capabilities include the ability to create custom dashboards, charts, and graphs.
The solution uses ML algorithms to predict potential issues and failures before they occur, with a prediction range of weeks, months, or yearly trending depending on the data available. It integrates with a wide range of BI sources and platforms.
Multicloud support is available for data from any cloud environment, including public and private clouds. Visibility into all environments is available through a single screen—without the need to build custom dashboards and reports.
Cisco AIOps monitors and manages SLAs, SLOs, and SLIs. It captures MTTx metrics that can provide information about past performance and situations when failures are likely to occur. The solution also provides metrics to help enterprises determine how well the AIOps solution is providing value. Enterprises can define their own metrics via a low/no-code environment.
Cisco currently provides AutoML capabilities within the Cisco AIOPs solution, which it uses to automate the process of building and training ML models. Cisco plans to improve its AutoML capabilities during the next 18 to 36 months.
The Cisco AIOPs solution leverages cybersecurity technologies such as threat detection, identity and access management, and SIEM by ingesting data from other tools. It evaluates new cybersecurity technologies and will incorporate them into the product as they become available.
The company recognizes the importance of explainable AI and plans to incorporate this capability in a future version of AIOps. In addition, Cisco has a dedicated team that is responsible for identifying, tracking, and managing bias in their product, though bias management is not available to customers.
Cisco AIOps is licensed via a subscription model, and professional services are available for deployment. The company provides tools that may overlap with existing solutions, but existing tools with overlapping functions can coexist with it, and minimal tool displacement is necessary for deployment. However, application performance management must originate from Cisco.
Cisco AIOps does not require extensive training to be useful. A low-code/no-code approach enables users to customize the solution, including dashboards and workflows, with minimal effort.
Moreover, the onboarding and setup processes require minimal technical expertise, and ingestion of a new data stream is easy to implement in Cisco’s low/no-code environment.
The solution is scalable in both SaaS and on-premises deployments, and Cisco can help on-premises deployments choose the correct hardware for both the initial deployment and for future expansion. For SaaS solutions, the default data retention period is ninety days, and it can be extended at an additional cost. Cisco supports common operational methods and best practices, and it includes a BC/DR plan for SaaS-only deployments. The solution is manageable and maintainable, and offers regular updates and support.
Strengths: Data collection is good, with support for OpenTelemetry. Auto remediation abilities are above average, and the solution provides multicloud support. Service-level management is included along with value metrics, scalability, and manageability.
Challenges: Predictive capabilities could be stronger, and BI integration is available only via APIs, with no out-of-the-box support. Explainable AI and AI bias management by customers are not yet available.
CloudFabrix Data-Centric AIOps Platform
Founded in 2013, CloudFabrix is a provider of AIOps and digital transformation solutions. In 2017, the company launched its flagship AIOps platform.
CloudFabrix’s data-centric AIOps platform combines observability, AIOps, and automation within a low-code environment. It is well suited for SMBs, large enterprises, and MSPs.
Data collection is robust and includes OpenTelemety and CloudFabrix’s RDA bots and data observability pipelines. The platform uses these low-code/no-code bots and pipelines to process, prepare, and transform the raw data into payloads that are tightly integrated around ITSM, collaboration, ChatOps, and workflow automation. AutoML is used in robotic data collection and pipeline creation.
The solution is built with “composability” as one of its foundational principles, allowing users to create their own dashboards. Dashboards are built using a drag-and-drop dashboard builder as well as JSON-based definitions. Reference dashboard templates are available for prototyping and iterative development.
CloudFabrix has predictive analytics capabilities built into the platform. As the data is ingested and collected, time series regression models are trained, which can help in performing trending, seasonality, anomaly detection, forecasting and prediction, and seven-day KPI forecasting. Models can be (re)trained on a periodic basis or if drift is detected. With enough data, the solution can predict a year into the future. For advanced scenarios, it allows complete customization of the ML model, the use of a different model, or customizing the model with hyperparameter tuning via the low-code ML pipeline.
Out-of-the-box integrations include Microsoft PowerBI, Tableau, Kibana, Grafana, and AWS Quicksight. There is also integration with Cisco Appdynamics Business IQ. Integrating with new BI tools is accomplished using low-code REST API bots and data transformation systems. Data payloads can be constructed and sent to the BI tool to populate analytics and dashboards.
The solution is architecturally built to support multicloud environments and hybrid cloud environments, including private-cloud remote/branch/edge environments. The platform provides single-pane visibility into multicloud data.
CloudFabrix offers SLO management capabilities by defining SLOs via policies and/or thresholds, and predictive alerting criteria can be defined to identify potential performance issues. The data can also be integrated with third-party BI tools.
CloudFabrix provides collaboration and workflow management as part of event correlation and incident management workflows. However, the solution is not dependent on integrations with other workflow tools and has the ability to conduct task and workflow management tasks natively.
CloudFabrix’s data-centric AIOps platform captures key MTTx value metrics to provide insights about performance, including forecasting when failures will occur and tracking how well such failures are detected and addressed. Metrics and KPIs can also be defined using low-code/no-code bots, and can be produced using a generative AI assistant and by querying using CloudFabrix’s CFX query language.
Cybersecurity is supported via log analysis, with additional functionality in this area on the roadmap. CloudFabrix does a good job of explaining the data used to make an analysis but does not explain how the AI uses that data to make predictions. There is no bias management capability; however, CloudFabrix has a pipeline synthesizer, which generates synthetic faults and can observe how the AI models and pipelines behave before production deployment.
Professional services are available but not required for deployment. Tool displacement is not mandatory, but the CloudFabrix data bots may provide data that’s redundant. Ease of use has improved over previous years, though some terminology may be unfamiliar (for example, composable vs. customizable dashboards).
Scaling is handled by CloudFabrix for SaaS deployments, while scaling for self-managed installations is handled by the enterprise. CloudFabrix provides good support that may eliminate the need for professional services.
The management of the solution is straightforward, but when large numbers of data bots are used, pipeline management can become complicated. The generative AI assistant can be helpful in managing the environment.
Strengths: CloudFabrix’s robotic data automation and data fabric provide a unique method of ingesting data. The solution deploys in any environment and has strong low-code/no-code capabilities throughout. OpenTelemetry data ingestion is supported. Predictive capabilities are functional. The generative AI assistant provides an additional method of querying the environment.
Challenges: Predictive capabilities, though workable, could use refinement for greater ease of use. Support for cybersecurity technologies is via integration, with nothing further on the roadmap. Explainable AI and AI bias management should be on the roadmap.
Datadog Watchdog AIOps
Datadog is a cloud monitoring and analytics platform for IT infrastructure and applications. The company was founded in 2010.
The target market for Datadog is SMBs and large enterprises. Datadog is deployed as SaaS with the ability to ingest data from on-premises environments. The solution consists of many modules, which can present challenges in the initial deployment.
Datadog includes an extensive integration library that allows data ingestion from almost any source. OpenTelemetry is supported, and the Datadog collection agent is open source. Datadog’s Event Management can ingest events from a number of third-party sources and can enrich these data.
Datadog’s AIOps solution is designed to be user-friendly and intuitive, with no extensive training required. The solution offers a low-code/no-code approach to customizing dashboards and workflows.
Datadog’s Watchdog, an intelligence feature that continuously analyzes data points related to applications and infrastructure, can perform auto-remediation using webhook integrations and monitoring APIs to trigger workflows. The low-code definition environment allows teams to build automated remediations with human-in-the-middle capabilities.
Watchdog provides support for data from multicloud environments as well as private clouds. Users have visibility into all of these environments through a single screen.
Watchdog integrates with various BI sources and platforms to provide a holistic view of business, IT performance, and actionable insights. Some out-of-the-box integrations with BI tools are available.
Watchdog captures MTTx metrics that provide information about past performance and help predict the likelihood and timing of future failures. Metrics also exist to help enterprises determine how well the AIOps solution is providing value.
Datadog provides good data visualization capabilities through its customizable dashboards and reports. Users can modify and save dashboards on the fly.
Forecasting is a Datadog feature that allows enterprises to predict where a metric is heading in the future and is especially useful for metrics with strong trends or recurring patterns. Predictive analytics is performed in a low-code environment, but knowledge of the metrics and parameters involved is necessary to achieve good results. Event prediction and forecasting are not supported.
The solution can define and manage SLAs, SLOs, and SLIs.
Workflow integration with common collaboration tools is provided out of the box. The workflow management interface uses a low-code environment to define workflows that can automate business processes.
Datadog’s AIOps solution currently does not use AutoML to automate the process of building and training ML models. The company plans to incorporate AutoML in the product in the next 18 to 36 months.
The solution leverages cybersecurity technologies such as advanced threat detection, identity and access management, and security information and event management (SIEM). These cybersecurity features may be accessed via the Datadog security module or through integration with other products. Datadog will continue to incorporate cybersecurity technologies in the product in both the near and long term.
Datadog does not offer explainable AI or enable AI bias to be managed by customers. Future versions may have these features.
Licensing is based on a subscription model with multiple tiers, starting with a free offering. The enterprise tier features more than 600 integrations and 15-month metric retention. Datadog provides professional services if needed. Note that Datadog has more than 20 modules, meaning it could displace many tools. However, the solution can coexist with existing monitoring and observability tools, minimizing tool displacement. The value of a single vendor platform for monitoring and observability is high; however, the cost, including political capital in large enterprises and the added complexity in licensing, requires analysis.
The solution is scalable, as Datadog provides all SaaS support. The default data retention period is 15 months, although it can be extended at additional cost.
Datadog’s AIOps includes a business BC/DR plan for SaaS deployments. For on-premises implementations, the company offers support for creating a BC/DR plan.
Agents and data collectors are considered part of the deployment, and they are managed by Datadog for SaaS deployments. Support is provided for self-managed installations.
Strengths: Datadog is strong in data ingestion, remediation, dashboards, multicloud support, task and workflow management, and cybersecurity.
Challenges: The solution consists of many modules, which can make deployment more difficult.
Digitate ignio AIOps
Founded in 2015, Digitate is a global provider of AI and automation solutions for IT operations. It launched ignio, its AI-powered platform that enables AIOps, in 2015.
The target market includes SMBs, large enterprises, and MSPs. Deployment is SaaS only. The solution has over ten thousand out-of-the-box automation capabilities and many integrations with major IT tools.
Digitate supports push and pull integrations for incidents, events, service requests, real-time feeds, asset information, batch job details, and more. New integrations can be built using ignio Studio, a low-code environment for extending out-of-the-box capabilities.
Digitate ignio generates a dependency hierarchy for any incident and performs health diagnostics to determine probable causes. It finds root causes, decides on fixes, closes the incident, and validates closure. Auto-remediation is a strong point for Digitate.
Digitate has a number of out-of-the-box reports and dashboards. Users can create their own dashboards, but there is no low-code environment for creating and modifying dashboards. Reports are rendered in the form of grids and charts, including time-series, scatter plots, bar charts, and heat maps. Reports and dashboards have varying levels of granularity, and a business function view can summarize a function and drill down to a sub-function.
Digitate ignio uses topology and historical data to perform predictions, adapts them in real-time, and provides sufficient warning in advance to prevent events. However, seasonality and load predictions are not supported. When observing delays or failures, Digitate uses back-propagation techniques to localize critical paths and Bayesian reasoning to narrow down the root cause. It uses deep learning models to examine historical learnings for a symptom, cause, or fix. Fault propagation algorithms assess the downstream impact of failure to infer criticality to act on the anomaly. Digitate ignio can predict failure three to four hours in advance, and it can also use coarse-grained predictions to make weekly, monthly, and yearly predictions. This is a standout feature for Digitate.
For BI, Digitate ignio allows the creation of customized extractors using API-based integration.
The single-pane cloud management console offers a convenient interface for managing hybrid or multicloud environments without the need to build custom dashboards and reports. Service-level management is included; however, service levels can be calculated. Digitate does not provide task and workflow management but relies on the integrated ITSM solution for these functions.
Value metrics are available via reports covering events, alerts, noise reduction, and MTTx metrics, and ignio offers an effectiveness dashboard for AIOps. Users can create their own metrics in ignio Studio.
ignio uses AutoML to develop solution pipelines of atomic ML algorithms for pattern mining, trend analysis, clustering, forecasting, and natural language processing (NLP). These solutions select the right algorithms and models for the right data, self-tune them, and self-learn with changing behavioral properties. The solutions also adapt by learning from user feedback, using the concepts of reinforcement learning. An area of future focus is augmented intelligence (AR), by which ML solutions can collaborate with domain experts to learn together and resolve unknown situations.
Digitate leverages web application firewalls (WAFs), SIEM, and related third-party offerings to secure its SaaS deployment. AI/ML security analysis of incoming data is on the roadmap.
Digitate ignio offers explainable AI for its forecasting models and prediction algorithms. It gives visual and textual information to explain the predictions and decisions. Going forward, Digitate plans to bring explainability to all ignio features, including anomaly detection, change-impact analysis, and optimization. Today, ignio information and explanations validate and justify the decisions made and analysis conducted by ignio for triaging and self-healing use cases.
ignio looks for AI bias by performing data quality assessment to detect any bias in the data and uses techniques such as k-fold cross-validation to minimize the impact of bias. To instill trust in users regarding bias in AI, ignio’s data quality assessment feature evaluates data on volume, velocity, variety, and veracity. It also provides visual evidence to explain ignio’s insights. Further development is planned to expose this to the enterprise.
Strengths: Digitate has strong auto-remediation and prediction capabilities. It is moving forward rapidly to implement emerging technologies such as AI bias and explainable AI.
Challenges: Cybersecurity currently depends on integration with third-party tooling. Direct management of SLAs is not available; however, service metrics can be calculated and displayed.
Dynatrace
Dynatrace is a global provider of unified observability and security software intelligence solutions. The company was founded in 2005 and went public in 2019.
The target market includes large enterprises and MSPs. SMBs could be candidates for Dynatrace but might find the size of the platform and the training requirements difficult to tackle. The solution is deployed as a SaaS platform, a self-managed solution running on a public or private cloud, or a hybrid of managed services and SaaS.
Data collection and analysis is a strong suit for Dynatrace and includes OpenTelemetry. Many integrations are available out of the box, and custom extensions can be created using the Dynatrace API.
Using the root cause analysis of its causal Davis AI engine to perform closed-loop automation via native integration or webhook, Dynatrace shows good automated remediation, with manual approval steps optional. Additional integrations with external automation tools are possible.
The better-than-average dashboards from Dynatrace are preconfigured, context-specific, customizable, and shareable for different user levels and use cases.
Predictive analytics via ML allows automatic baselining, including seasonal baselines. Dynatrace uses probability to compute the confidence band for predictions.
Dynatrace natively integrates with major cloud and virtualization vendors in addition to on-premises systems, including container platforms. BI integrations are available via open APIs and extensions, and some integrations are provided. The solution can bidirectionally integrate with third-party messaging or incident-management systems that contain task and workflow systems.
There is native support for SLOs and SLIs, an improvement over the average AIOps solution. Customers can define, measure, and monitor SLOs to automate SLO validation and quality gates.
Value metrics provide details on detected problems, including start times, end times, and related metadata. MTTD, MTTR, and similar metrics can be constructed, charted, and observed (baselining and alerting) with a low/no-code approach via Dynatrace Data Explorer and the Dynatrace Query Language.
Runtime vulnerability analytics (RVA) provides vulnerability detection in production environments with AI-powered risk assessment for prioritization. Runtime Attack Protection (RAP) detects and blocks attacks in real-time. No additional deployment is required; Dynatrace RVA works where OneAgent is available. Security problems are raised, tracked over time, and closed automatically as soon as the vulnerability is remediated.
Dynatrace does a good job of explaining the source of the data used by the AI, how the data were used, and the conclusions drawn, but does not explain the AI itself. As this is an emerging area for AIOps solutions, fully explainable AI is in its infancy for most vendors.
Dynatrace does not rely on humans to analyze training data and rules out this aspect of bias. Instead, it uses the Davis self-monitoring dashboard to observe the operational quality of the AI models.
Dynatrace solutions are available via subscription license, and Dynatrace offers usage-based pricing across all solutions. Alternatively, customers can purchase a platform subscription on an annual commitment, and may then use any Dynatrace capability with no additional licensing or purchase required.
Dynatrace displaces tools, including other AIOps, APM, cloud and infrastructure monitoring, log management and analytics, digital experience management, application security, business analytics, and automation solutions.
The solution is generally easy to use, but its complexity can make adoption strenuous. Moreover, the training requirements may create friction during the deployment process. The use of a low-code/no-code approach does simplify the configurations of dashboards, integrations, and extensions, however. Professional services are not required, but larger complex enterprises may consider them necessary.
Dynatrace scales well and, in the case of the SaaS product, is easy to maintain via its OneAgent technology. Data can be retained as long as five years (at a cost). The company is moving all data types to Grail, its data lakehouse. Dynatrace uses autoML to optimize parameters by continuously analyzing the current data situation.
New versions of Dynatrace are rolled out every two weeks on SaaS clusters. Self-managed customers can apply monthly functional updates with a SaaS-like experience.
BC/DR are managed by Dynatrace for SaaS installations. Self-managed users have a self-contained, out-of-the-box solution that enables near-zero downtime and allows monitoring to continue without data loss in failover scenarios.
Strengths: The Davis AI analytical and forecasting capabilities make it a strength for Dynatrace, and its Grail repository technology, prediction, service-level management, scaling, and manageability and maintainability are also strong. There is some autoML with more planned. Dynatrace continues to handle the balance between complexity and ease of use well. Cybersecurity support is good for an emerging technology within AIOps.
Challenges: Dynatrace may not be well suited for SMBs due to training needs and time-to-value. Setup of this AIOps solution may have a long learning curve for large operations teams accustomed to event correlation but not full-stack analysis.
Elastic Observability
Elastic is a global software company that provides free and open solutions for search, logging, and analytics. The company was founded in 2000, and in 2014, it launched Elasticsearch, its flagship product. The target market is MSPs, SMBs, and large enterprises. The solution can be deployed on-premises or in a public or private cloud.
AIOps is part of the Elastic Observability solution. The product ingests metrics, logs, and traces from applications hosted in a data center or a public cloud environment. All business and operational data is ingested using integrations, which provides support for open standards and open-source and a common schema for metrics, logs, traces, and security events based on the Elastic Common Schema (ECS) and OpenTelemetry (OTel).
Elastic provides remediation capabilities to keep the deployment healthy, but there is no automatic remediation within Elastic Observability. However, auto-remediation can be done via integration with third-party tools.
Elastic’s dashboards, driven by Kibana, are customizable and provide search, drilldown, pivoting, and visualization capabilities.
After anomaly detection, Elastic creates baselines of normal behavior. These can be used to extrapolate future behavior. By default, that future is one day. Typically, the farther into the future the forecast, the lower the confidence levels become. Eventually, if the confidence levels are too low, the forecast stops. The ML engine is data-agnostic and can be used to support data that’s being ingested to find anomalies and outliers and produce forecasts based on trends.
As with a number of AIOps vendors, BI data can be integrated with Elastic; however, there is no built-in capacity. Of course, Kibana can be used to visualize BI data after integration. BI Integration is via APIs and may not be as simple as integrating a metric-type data source.
Public and private clouds are supported, and multicloud support is available using the open source tools that make up Elastic. With a subscription, the support is complete. The use of open source technologies may complicate implementations for enterprises not well versed in this technology and without a subscription license.
Service-level management is not a part of the Elastic solution. There is no support for defining SLAs or SLO. There may be open-source add-ons of such functionality within the Elastic community.
Task and workflow management is handled via integration with third-party tools. The task management API is new, having recently reached GA. Integration with standard ITSM tools is available out of the box or from the Elastic community.
Value metrics can be defined and customized into dashboards. There are no native value metrics in Elastic Observability. Elastic does not provide metrics on how the solution is performing. Other metrics such as MTTx can be created using manual methods.
Elastic is strong on security integration, which allows operations and security teams to be unified on a single platform to monitor application and infrastructure performance. SIEM and endpoint security are built in, and the platform provides mitigation, detection, and response, with ML and behavior analytics available to detect and react to threats. Elastic has good AI/ML capabilities that center around correlating anomalies to downstream data and dependencies to assist with root cause analysis; however, autoML is not available.
There are two unsupervised ML model types: anomaly detection and outlier detection. Anomaly detection runs continuously and creates a probability model for identifying unusual events. Outlier detection, which does not run continuously, identifies unusual points in a dataset by analyzing a point’s proximity to similar data points and the density of the cluster of points around it. Additional manual tuning is possible. However, explainable AI and AI bias are not addressed by Elastic Observability.
Licensing costs vary with the edition, from free up to the enterprise edition with host- and consumption-based licensing. The enterprise edition includes professional services support and is necessary for AI/ML usage. Tool displacement can be none at all or significant depending on enterprise needs. As an open-source solution, tool displacement is more of a strategy than a requirement. Enterprises already supporting open-source tools for monitoring and data gathering may find Elastic a good fit.
The Elastic Observability solution can be easy to use for enterprises with the technical ability to handle open-source software. The interface varies from low-code in some areas to programming in others. The Elastic community provides support along with professional services from Elastic.
Scaling is easy for the SaaS deployment—it is handled by Elastic. For self-managed implementations, support from Elastic may be necessary to meet BC/DR requirements for the enterprise. The open-source community also provides tips on scaling and BC/DR.
Strengths: Elastic has competent data ingestion using open source collectors and agents. OpenTelemetry is supported. Dashboards are customizable and can be shared with others. The SaaS solution scales easily. A security tool, Elastic Security for SIEM, provides the basis for cybersecurity.
Challenges: Elastic continues to struggle with identity. Its previous background was in log management and APM, and adding observability now makes a fit with Elastic difficult to determine. The lack of automated remediation and task/workflow management stand out, though integration with third-party tools is possible. The solution doesn’t support autoML, explainable AI, or AI bias management.
Evolven Configuration Risk Intelligence
Founded in 2007, Evolven’s patented technology uses ML algorithms to detect patterns and anomalies that could indicate potential problems.
Evolven does not meet the table stake for ML-based real-time monitoring. This would normally disqualify the company from this Radar; however, the company provides an enterprise solution that examines configuration state in a unique way that compliments AIOps, making it worth considering as an additional tool with a high level of intelligence. The target market includes MSPs and large enterprises, including the public sector. Many Evolven customers are in the financial services industries; however, the product can be used in any market segment.
Evolven concentrates on configuration changes, collecting detailed configurations for every layer of the hybrid cloud, resolving them to individual parameters. Once configuration baselines are in place, Evolven tracks granular changes to these baselines executed either manually or automatically, planned or unauthorized.
The solution consists of the Evolven Configuration Risk Intelligence Platform server and Evolven agents. It can be deployed as a virtual appliance on Windows or Linux, a public cloud image (essentially the same as the virtual appliance), self-managed software deployed on-premises, and as a SaaS offering.
Metric or performance data are collected via integration with other monitoring, observability, or AIOps tools. Evolven consumes any form of data via its agents or using an API. The data recognized out of the box includes code changes, configuration changes, change records, CMDB CI information, asset data, incidents tickets, monitoring alerts, KPIs, CI/CD metadata, automated deployment metadata, and essential system state information. Evolven ingests from an above average number of data sources.
Evolven generates artifacts that can be plugged in to the deployment and remediation sequences of other tools. Though this may be adequate in many cases, auto-remediation is not a strong point for Evolven. It complements DevOps efforts well by providing end-to-end observability for the CI/CD pipeline.
Evolven has strong customizable dashboards with excellent navigation and drill-down capabilities. New customized dashboards can be created on the fly and shared with others. Comments can be added and actions taken using data on a dashboard.
The Evolven platform applies AI-based analytics to identify high-risk changes and misconfigurations that might adversely affect the stability, compliance, and security of the environment. Evolven assesses various aspects of configurations and changes to estimate the possibility of a negative impact, including how a change was executed, the timing of the changes, the history of similar changes, the anomalous nature of the changes, benchmarks of the configuration across the environment, and much more. However, the risk estimation does not include a timeline of any future impact. The prediction capability of Evolven is below average for this Radar in keeping with its state-based model; on the other hand, it complements other AIOps tools for root cause analysis.
Evolven provides a range of capabilities that allow it to extract BI data, including KPIs from other tools. However, it does not have prepackaged integration with BI tools.
Evolven supports multicloud environments, including public clouds, private clouds, and hybrid clouds. It provides an end-to-end view of the configuration of the environment, unified via a single pane of glass.
Evolven does not monitor and manage performance and availability but rather analyzes the risks coming from misconfigurations and changes. The solution helps to achieve and improve SLAs, but it does not monitor them. To make up for this while staying in its lane of state-based analysis, the Evolven platform ingests from real-time observability tools to provide an effective metric on SLAs.
Evolven’s platform facilitates collaboration and streamlines workflows by providing task assignments and tracking both within the platform with third-party ticketing systems. Workflow integration with collaboration tools is provided out-of-the-box. New workflows can be configured without coding.
The solution enables customers to define their own metrics via a no-code environment. These are typically focused on compliance, configuration drift control, unauthorized changes, and change process compliance.
Evolven does not offer AutoML or explainable AI. The core of Evolven’s platform is a change-centric vertical AI that is revealed in the product in various places. This vertical AI is pre-trained and uses some unattended ML algorithms applied to the specific problem of change and configuration management. Evolven integrates with existing cybersecurity tools, and all information needed to create a risk score is available along with the logic and decision trees.
Evolven provides manual features that impact the AI’s decision-making regarding change risks, and it therefore controls bias within the system. For example, marking a change as suspicious, high risk, or critical impacts the AI.
Licensing uses a perpetual/per operating system instance (OSI) model or a subscription OSI model. The Evolven License Edition is licensed for one agent per OSI, which can be used in any production or non-production environment. The total number of OSIs is most often calculated as the total number of servers, plus total number of VMs, plus network, security, and storage devices. Network, security, and storage devices are calculated in a 5 to 1 ratio, where every five devices equal one OSI. Tool displacement is unlikely; Evolven is a unique solution.
Strengths: Evolven has very strong change detection and reporting capabilities. The AI provides excellent insight into parameter and configuration changes and differs from other vendors in this Radar by being focused on system state. Evolven successfully finds “ghost” changes to the environment residing as parameters and within configuration files.
Challenges: The Evolven solution’s lack of direct real-time monitoring has been noted; however, its ability to see changes in the enterprise state merits the inclusion in this Radar. There are no prediction capabilities, but its AI engine is capable of identifying risky changes before they become a performance problem or impact availability.
HCLSoftware DRYiCE
HCLSoftware is a global technology company that was founded in 1976 and has made a number of strategic acquisitions in recent years, including the purchase of British IT services firm Axon in 2008 and software company Geometric in 2016.
DRYiCE supports MSPs, SMBs, and large enterprises. Deployment is via a managed host offering on Google Cloud Platform (GCP) and an on-premises deployment.
DRYiCE connects to monitoring systems and ITSM tools through a REST API or custom integrations. There are a number of out-of-the box integrations that support other available tools. DRYiCE does not support OpenTelemetry directly.
DRYiCE’s auto-remediation feature supports over 3,000 runbooks. The internal workload automation module can automate workflows, and it comes with out-of-the box support common for orchestration tools. HCLSoftware supports a REST API. Its feature set is above average.
Dashboards can be customized using filters and custom widgets in a low-code environment. All widgets support drill-down options with increasing granularity at lower levels.
Predictions related to a business’s key performance requirements can be made in time spans of minutes, hours, or any other increments, based on existing data. These predictions can be used to find and correct events and incidents before they occur, to forecast workload demands, predict SLA breaches, and drive notifications. This is above average for AIOps solutions. Moreover, DRYiCE can provide hybrid cloud lifecycle management with self-service provisioning. The FinOps module handles predictive analytics, forecasting, and cost optimization recommendations with pre-configured dashboards and unified reporting.
DRYiCE can show KPI health status, which can be mapped with SLAs, SLOs, and SLIs. Unified reporting modules are used for standard reporting of these metrics.
The intelligent ticket assignment capability ensures that the tickets are assigned to the right support team. Teams can leverage an AI cognitive virtual assistant called Lucy that uses NLP to enhance user experiences. The workflow orchestration and chat workflow configuration can be accomplished using low-code features. Business workflows are set up using the drag-and-drop workflow builder. BI information is ingested via APIs. All the audit logs are stored and can later be used as reporting metrics. There is a unified reporting module that enterprises can use to show AIOps performance. Custom reports that let users set up their own metrics can be defined.
AutoML is not part of the product, nor is it found in the current HCLSoftware roadmap. Some aspects of cybersecurity are covered by DRYiCE patching and compliance module. Patches are correlated with vulnerabilities for detection and prioritization. Additional threat detection is on the roadmap. Explainable AI has been added to the cognitive assistant module and may be available in the next 18 to 24 months. AI bias management is not on the current roadmap. The AI team is evaluating its applicability for a future release.
DRYiCE has flexible consumption models that allow customers to pay as they use or opt instead for full-stack AIOps with discounted pricing, which comes with a term license.
The full-stack AIOps offering is charged per instance per year. There are two plans: the standard plan is for medium-size customers with 2,501 to 5,000 end points, while the enterprise option is best suited for those with 5,001 to 10,000 end points. Professional services are required for deployment.
Tool displacement is not necessary; customers can choose the modules required. The existing tools can stay, and DRYiCE can integrate with them. Scaling the solution requires professional services, and BC/DR plans are developed in conjunction with professional services, with updates performed in a user acceptance/testing environment. The agents installed for remediation are updated manually.
Strengths: DRYiCE has strong predictive capabilities, and its auto-remediation abilities are above average. The ability to use low-code/no-code features to customize the dashboard makes it easy to create a unified view of the IT environment.
Challenges: There is no direct OpenTelemetery support. Professional services are required for deployment. The HCLSoftware AI team should evaluate emerging technologies that will be available in the near future.
IBM Cloud Pak for Watson AIOps
IBM developed its AIOps solution to help organizations improve the performance and reliability of their IT systems. IBM AIOps was first introduced in 2018 as a combination of IBM’s existing IT operations management tools and ML algorithms that can analyze vast amounts of data in real time.
IBM Watson AIOps caters to a wide range of businesses, from SMBs to large enterprises. Deployment can be on a public or private cloud, hybrid cloud, or on-premises.
IBM Watson AIOps can ingest data from various sources, including logs, metrics, and events, and via APIs and pre-built integrations, but it does not natively support OpenTelemetry data ingestion.
Ease of use is above average. Watson AIOps includes built-in support for runbook auto-remediation and supports auto-remediation through integration with external tools, systems, or platforms. It does not include a native dashboarding feature for creating custom dashboards. It is designed to integrate with other monitoring, observability, and analytics tools that enable custom dashboard creation and sharing, such as popular dashboarding and visualization tools like Grafana and Kibana.
Watson AIOps supports predictive analysis, using historical data to forecast trends and patterns. It can also forecast incidents and perform root cause predictions.
IBM Watson AIOps does not directly support BI tools, but customers can connect Watson AIOps by using APIs or custom integrations to extract relevant data and generate reports or visualizations in the BI tool of choice.
IBM Cloud Pak for Watson AIOps supports multicloud environments via integrations available within the base features of the product. Depending on licensing, this software may be included with Watson AIOps.
While IBM Watson AIOps does not explicitly manage SLAs or SLOs, it provides functionality and insights that enable IT operations teams to monitor, maintain, and improve service levels, helping organizations meet their service level commitments.
IBM Watson AIOps provides native task and workflow management features outside of runbook support. Automations can be triggered by alert properties, business criticality, and anomaly insights. It integrates with various external tools for incident management, additional remediation workflows, and collaboration and communication solutions.
The platform employs advanced ML algorithms and techniques to analyze data, detect anomalies, predict incidents, and identify root causes. These algorithms are not exposed. Some algorithms can be automatic (AutoML), while others require configuration. IBM has a strong history in AI. More autoML features will exist in the future.
Watson AIOps is not explicitly designed as a cybersecurity solution, but it is able to enhance an organization’s cybersecurity posture indirectly by providing insights into the IT environment, detecting anomalies, and predicting potential issues. IBM sells cybersecurity solutions that can be integrated with Watson AIOps.
Although Watson AIOps does not specifically emphasize explainable AI, it strives to make the insights and recommendations it generates as transparent and understandable as possible for users. For example, when Watson AIOps identifies anomalies or potential incidents, it provides information about the factors that contributed to the detection, such as specific metrics, logs, or events.
IBM is committed to developing AI systems that are transparent, fair, and unbiased, and this commitment extends to Watson AIOps. However, bias management is not available.
Tools displacement is not necessary; however, IBM has solutions for most tools in any environment from mainframe to cloud-based application environments. Professional services are not required, but organizations may benefit from using them when implementing IBM Watson AIOps in complicated enterprise environments. The solution collects and measures license usage via the virtual processor core (VPC) metric at the cluster level.
The solution can be scaled and customized to meet the specific needs and requirements of different organizations, which, due to the containerized deployment model, takes minimal effort. Managing the environment is easier than average, as IBM provides extensive documentation and support.
Strengths: Watson AIOps has good predictive analytics. The multicloud support covers any cloud environment. The scaling and management capabilities of the solution are good. The strength of the Watson AI has been demonstrated in other areas and should carry forward in AIOps.
Challenges: There is no support for OpenTelemetry, but the solution can ingest data from other sources via API integration. Watson AIOps relies on API integrations that will likely need professional services in complicated environments.
Interlink AIOps
Founded in the UK in 1996, Interlink AIOps has gained recognition as a promising player in the AIOps market.
Interlink’s AIOps offering is an AI-driven solution that streamlines IT operations by automating anomaly detection, root cause analysis, event correlation, and predictive analytics. It is designed to utilize ML algorithms to continuously improve its performance.
The target market for Interlink AIOps is large enterprises and service providers. It is not well suited for SMBs. Deployment can be either in the cloud or on-premises. A hybrid solution using both implementations can be visualized via a single screen.
A number of out-of-the- box integrations are provided by Interlink leveraging an API, SNMP, email, database webhooks, and REST interfaces. OpenTelemetry is supported. Interlink has an automation engine and integrates with other automation and orchestration tools. Dashboards are customizable and can be shared. Preconfigured dashboards come with the product as a starting point for customization.
The predictive capabilities of the solution are based on the amount of historical data available and cover metrics, events, and incidents. BI data can be ingested using the API and merged with monitoring and observational data to generate a business view of the enterprise. Multicloud support is available by deploying agents in each cloud environment.
Service-level management is built into the platform, and there is a low-code workflow management system. Value metrics are handled using pre-configured dashboards that can be customized using service-level management configurations or defined as needed.
Interlink AIOps does not have any of the emerging technologies discussed in this Radar.
Premium licensing includes integration services. Other licensing models may be available depending on discussions with Interlink.
The solution is domain agnostic and does not require displacement of existing tools. Interlink does, however, offer enterprise monitoring capabilities and an incident management solution that includes a mobile app.
Professional services are offered but not required for deployment. Interlink can be deployed without the need of professional services even though AIOps implementations are a large lift in any environment.
Ease-of-use is on par with other AIOps offerings. Low-code design methods are used throughout the product, and the look and feel are consistent within the solution.
The cloud implementation scales easily and is maintained by Interlink. On-premises deployments require planning for scalability and BC/DR. A need for professional services is possible, but not required for savvy enterprises.
Strengths: Interlink has a good automated remediation module. The service-level management tools work well. Interlink is ranked above average for task and workflow management. Data ingestion includes OpenTelemetry.
Challenges: The lack of BI integrations means enterprises will need to develop the integrations themselves. The state of the current AI/ML in Interlink precludes discussion of emerging technologies.
ITRS Group Geneos
ITRS Group is a technology company founded in 1997 in London, England. Its flagship product is the Geneos platform, which provides real-time monitoring and analytics for mission-critical IT systems.
ITRS Geneos unites the company’s products into a single platform. Included are Obcerv, which centralizes critical monitoring data (such as metrics, states, and logs) from infrastructure, applications, and networks into a single repository, and Active Console, which allows customers to connect to and view the contents of Geneos Gateways, which sit between NetProbes and Active Console. A Netprobe is a lightweight monitoring agent deployed on every managed node.
ITRS supports SMBs, large enterprises, and MSPs. ITRS also markets and creates vertical integrations for finance, healthcare, and other sectors. Deployment can be via SaaS, on-premises, or a hybrid combination.
Geneos can ingest data via OpenTelemetry or the hundreds of out-of-the-box integrations that ITRS supplies. NetProbes are used on managed devices to gather data and control the system. Though it can ingest data from various sources, the AIOps solution is not totally domain agnostic, as many parts of it must come from ITRS. Geneos is capable of auto-remediation via a Geneos gateway and REST APIs. Programming skills are needed for this functionality, though the interface is low-code. Dashboard customization is enabled by a no-code environment, and widgets added to a dashboard can be customized via low-code configuration.
Predictive analytics capabilities are minimal. BI is handled via integrations with BI tools using the ITRS Opsview product. Multicloud support, using Geneos gateways, is available for public and private clouds. ITRS provides service-level management via its report generator. Each SLA must be created manually via the low-code report generator. Knowledge of devices and services is needed to create an SLA report. Task and workflow management within the Geneos ecosystem is good, but the bi-directional integration of ITSM workflows is not straightforward. Value metrics, such as MTTx, can be defined and shown in dashboards where needed.
ITRS Geneos offers none of the emerging technologies discussed in this Radar.
ITRS requires a Geneos environment, a consideration that impacts the solution’s cost and value, as it’s an extra requirement for users. Tool displacement is not required because data from existing tools can be ingested; however, the need for additional ITRS agents and collectors requires redundant software on many systems, negating that advantage. Ease of use is average, but admin-level dashboards have to be created. Scaling the solution when it is managed by ITRS is simple, but management of the various Geneos gateways, NetProbes, and hubs may make scaling difficult—support for Puppet and HELM are present. The self-managed version requires careful planning before increasing the size of the solution. Management of Geneos, especially for self-managed deployments, can be awkward. Many of the components can’t be updated automatically, requiring manual intervention.
Strengths: ITRS has a good monitoring infrastructure to collect data. The dashboards are customizable via a low-code environment. Multicloud support is good.
Challenges: A complete solution includes additional ITRS tools, which make tool displacement a concern. Auto-remediation is not available, although the API can be used to remediate some conditions, which may require programming skills. Predictive analytics capabilities are lacking in the current iteration of ITRS Geneos. BI information must be integrated manually. There is no support for service-level management.
LogicMonitor Enterprise
LogicMonitor is a technology company founded in 2007 in Santa Barbara, California. The company provides SaaS-based performance monitoring and analytics solutions for IT infrastructure, cloud services, and applications. In 2020, it made a number of acquisitions, including Swedish AI and the ML startup Unomaly.
AIOps from LogicMonitor is a cloud-based IT operations platform that leverages AI/ML to provide highly scalable agentless monitoring and predictive analytics for cloud, hybrid, and on-premises environments. Its customizable dashboards, alerts, and reports enable tailored insights, and it has automation capabilities.
LogicMonitor can be used by SMBs, large enterprises, and MSPs. LogicMonitor is a SaaS application that supports a broad range of vendors in hybrid infrastructures.
Metric ingestion is done with LogicMonitor Collectors. OpenTelemetry collectors are also supported. Enterprises can integrate LogicMonitor with third-party automation solutions to provide a seamless experience between the two platforms for monitoring new and updated infrastructure. Integrating LogicMonitor with an automation integration allows provisioning, managing, and remediation based on LogicMonitor alerts. Dashboards can be created, configured, and shared with others. The interface is low-code, but users must have some understanding of the underlying metrics.
LogicMonitor’s data forecasting allows you to predict future trends for your monitored infrastructure, using past performance as the basis. Predictive analytics can extend to one year, depending on availability of data. Customers can set alerts on forecasted data. Out-of-the-box BI integration is not supported. Service levels can be defined for any device or groups of devices that may be defined as a service. Task and workflow management are accomplished via integrations with external tools, allowing the correlation of alerts with the relevant metrics, logs, traces, and topology. Bi-directional data can be passed to support alerts and events triggered by LogicMonitor. Value metrics can be calculated using the SLA report and defining which values and calculation methods to use.
LogicMonitor does not offer any of the emerging technologies identified in this Radar report.
LogicMonitor does not require tool displacement; however, the greatest value will be returned by using LogicMonitor tooling for as much work as possible. Professional services are not required. Managers of larger and more complex deployments should investigate the need for services. Only the LogicMonitor Enterprise deployment supports AIOps and anomaly detection. It also includes the LogicMonitor APM solution and device configuration management. Displacement or redundant usage of existing tools may take place.
Strengths: LogicMonitor’s data collection is good, including that sourced from OpenTelemetry and custom dashboards (some low-code effort is required at times). LogicMonitor provides predictive analytics out to one year, and there is multicloud support with a single view of all cloud metrics and events.
Challenges: There is a lack of AI transparency and no support for emerging technologies. Service level dashboards can be time consuming to create.
Logz.io
Logz.io was founded in 2014 in Tel Aviv, Israel. The company provides a cloud-based platform for log analysis and monitoring, and uses ML and analytics technologies.
Logz.io is an open source-based platform powered by OpenSearch, OpenTelemetry, Prometheus, Jaeger, and the company’s own Cognitive Insights AI solution. The target audience is any enterprise, regardless of size.
Deployment is via SaaS. Logz.io maintains clusters worldwide so it can host accounts in the region that is closest to the enterprise. However, there are no clusters in South America or Africa.
Data ingestion is via the open source collectors in OpenTelemetry, Prometheus, Jaeger, or the large number of integrations available from Logz.io. Integrations for logs, metrics, tracing data, cloud SIEM, and synthetic monitoring are available. Additionally, the open source community publishes integrations compatible with Logz.io.
Integration with BI tools can be achieved only by using the Logz.io API. However, integrations can be built with Terraform, a popular open source infrastructure as code (IaC) tool that does away with most manual configuration processes. The open source collectors supported by Logz.io allow public and private cloud support. Service-level management is not a part of Logz.io. Out-of-the-box task and workflow management are limited to sending data to ServiceNow or Opsgenie; however, other endpoint integrations are possible.
Dashboards are customizable, and a number of preconfigured dashboards are available out of the box. Dashboards with value metrics can be created but are not available upon deployment. Predictive analytics capabilities are minimal and limited only to some infrastructure metrics. There is no automated remediation capability within the platform.
Logz.io does not offer the emerging technologies discussed in this report (Table 4).
Evaluation of cost and value is always difficult for open source products. The licensing costs are competitive but broken up into log management, infrastructure monitoring, and distributed tracing. The licensing of cloud SIEM is similar to log management and measured in GB ingested per day. Infrastructure monitoring is per 1,000 time series metrics per day with 18-month retention. Lastly, distributed tracing is per million spans, per day with 10-day retention.
Enterprises should be well versed in the underlying open source offerings. Displacement of existing tools is likely, and support will come from the community rather than professional services. Ease of use is average and is based on enterprise comfort with open source products. The solution scales well because it is a SaaS offering available in multiple regions. Manageability is also good for the same reason.
Strengths: Logz.io has good cost and value, depending on the open-source comfort level of the enterprise. There is good multicloud support.
Challenges: There is no automated remediation. The predictive capabilities are limited to a few metrics, and the solution does not handle seasonality. BI integration is via API with no direct support. The task and workflow management is a one-way solution to ServiceNow or Opsgenie. A confusing interface makes the creation of value metrics difficult. The open source basis of the solution may be a challenge for some enterprises.
meshIQ AIOps
meshIQ, formerly Nastel, is a technology company founded in 1994 in Melville, New York. Its solution is an observability platform for messaging, event processing, and streaming across hybrid cloud (MESH).
The target market is financial institutions and large enterprises needing insight into business transactions within message queues. Deployment is via SaaS or a self-managed installation on-premises or in a private or public cloud.
meshIQ can track transactions across multiple middleware and infrastructure environments, stitching related messages together to visualize the business flow and to identify and alert on anomalies, latency, and lost and diverted messages using real-time data analytics. Transactions can include applications such as IBM MQ, Apache Kafka, Solace, RabbitMQ, ActiveMQ, and many more. meshIQ can ingest data from OpenTelemetry. It uses agents to collect data.
The meshIQ solution can automate corrective and preventive actions via scripting and APIs. Integrations with ticketing systems such as ServiceNow, event management systems, and collaboration tools enable automation wherever it is relevant.
meshIQ provides out-of-the-box, configurable dashboards, views, and other visualizations and reporting to meet different business needs. Viewlets, which can have their own URLs, provide a summary of the number of objects (such as events, activities, or snapshots) and give operations teams a way to share critical information with other IT groups and with business users.
Multicloud support is part of the basic architecture, and message queues that move among private clouds, public clouds, and on-premises infrastructures are supported. Although the solution can report on a defined service level, it provides no service-level management. Task and workflow management for messages within a queue is good. Connections to ITSM systems allow task management, which is typical for AIOps solutions. Value metrics and MTTx can be defined but are not included in the out-of-the-box solution.
The cost and value equation for meshIQ can be complicated by the focus on message queues. If understanding the internals of message transactions is important to the enterprise, meshIQ is an easy choice. No tool displacement is necessary, and professional services are available if needed.
The solution is easy to use, and a low-code environment is available for some aspects of it.
Strengths: The strength of meshIQ is its ability to not just monitor message queues from an infrastructure perspective but to manage and report on the internals of a messaging system. meshIQ has outstanding capabilities in automation, dashboards and reports, and data consumption. In addition, the open nature of the tool makes it a good fit for open-source shops. The ability to customize functions also provides more flexibility. meshIQ provides peerless awareness of middleware message flows.
Challenges: The solution’s focus on message flows and application flows may be too limiting. AI/ML tools have improved but are not on par with other AIOps solutions. Predictive analytics are elementary, but the roadmap looks promising. There is no service-level management.
Moogsoft AIOps
Moogsoft is a technology company founded in 2011 in San Francisco, California. The company has expanded its AIOps offerings to include a range of additional features and capabilities, such as event correlation, root cause analysis, and collaboration.
Moogsoft is a single domain-agnostic solution with two deployment methods: SaaS and self-managed. The SaaS platform is hosted by Moogsoft on AWS. The self-managed offering can be on-premises or hosted by Moogsoft on AWS. However, there are features in the self-managed version that can’t be implemented in the SaaS version and SaaS features that can’t be implemented on the self-managed platform. Though the Moogsoft roadmap for parity is in progress, Moogsoft encourages users to initially adopt or move to the SaaS implementation.
Moogsoft can ingest any data from anywhere and natively supports events, metrics, and logs. In the SaaS version, ingestion is either by a predefined integration or in a no-code “configure your own integration” environment. The self-managed platform requires coding to create a new integration. OpenTelemetry is supported in both implementations.
They can automatically trigger remediations at the alert and incident level, and it provides resources for remediation—its own workflow engine, webhooks capabilities, and an API enable integration with any third-party or customer-provided remediation solutions.
Moogsoft’s customizable dashboards can be shared with users and groups, and the solution comes with a sample set of dashboards in the SaaS implementation. Grafana is an unsupported add-on for self-managed enterprises.
The AI/ML is designed for real-time applications and does not perform linear-regression style prediction.
Moogsoft provides multiple techniques to integrate with BI tools. A BI tool can export to Snowflake, where Moogsoft can extract and build customizable insights using tools such as Looker, Power BI, Grafana, and others. For the self-managed implementation, Moogsoft provides sample dashboards for Grafana that use the Moogsoft APIs.
Moogsoft consumes events and metrics from AWS, Azure, and GCP. Private clouds can also be supported using the ingestion APIs to provide a single view. The deployment effort to ingest data from multiple cloud environments takes approximately 30 minutes per cloud provider. In self-managed environments, the effort will take longer. Most AIOps solutions have similar capabilities.
Moogsoft captures and stores incident state change times, which is published via API. These can be combined with additional criteria to provide customer-defined SLO and SLA data. Moogsoft provides no direct service-level management.
Workflow is considered an area of strength for Moogsoft in the SaaS version (less so in the self-managed implementation). The no-code workflow engine can drive a broad range of workflows throughout the incident management lifecycle, using BI data to prioritize and escalate them. This is better than the average for AIOps solutions. Additionally, the Moogsoft situation rooms are a key differentiator, providing similar situation detection and solution suggestions, resolving steps based on root cause analysis, and doing enhanced timeline management.
Moogsoft captures and stores incident state change times, which are published via API for use by other tools. These can be used to compute MTTx metrics. MTBCF can be calculated if criticality is defined and flagged as part of the incident.
Moogsoft does make use of AutoML within the product, fully in some areas and only partially in others. Entropy, Vertex Entropy, and Tempus are fully automated unsupervised algorithms. Probable root cause (PRC) is partially automated. It must be trained by an operator, but the operator doesn’t need to have any knowledge of the underlying technology. The user flags alerts as either root cause or symptoms.
All current AI capabilities are explainable and documented. In the last year, Moogsoft has added additional user capabilities that make it easier to understand which instance of an algorithm is responsible for a decision, both for clustering algorithms and metric anomaly detection.
Moogsoft provides a single pane of glass for various cybersecurity technologies. It currently integrates with and supports SonarQube, Rapid7, Deepfactor, Lacework, GuardDuty, SecurityScorecard, and UpGuard. Moogsoft’s excellent performance and correlation capabilities (especially the NLP-based similarity matching) can be used for SIEM-like use cases.
Strengths: Moogsoft provides excellent AI capabilities, with good use of autoML. The product is well placed to take advantage of forthcoming innovations in AIOps (like LLM and generative AI). The solution can ingest any data from any source. Auto-remediation is built into the product along with task and workflow management. The Moogsoft situation rooms provide an excellent method for managing groups, triaging events, and working through root cause analysis.
Challenges: Moogsoft has no predictive capabilities. The parity between SaaS and self-managed deployment is not total, and 100% parity is not possible due to technical differences in the deployments. For new SaaS customers, this is not an issue. Self-managed users will have to manage the transition to SaaS to ensure results are consistent.
Netreo AIOps
Netreo is a technology company founded in 2019 in San Francisco, California. Its AIOps platform uses advanced AI and ML technologies to help organizations manage and optimize their IT operations. Netreo is a rising player in the AIOps market.
Netreo’s AIOps: Autopilot engine operates continually in the background of the company’s ITIM solution, scanning, analyzing, and reporting on how users can optimize their Netreo instance. Netreo’s automation architecture leverages APIs, CMDB systems, and cloud or SD-WAN providers to learn new devices.
Netreo targets all markets, including service providers, SMBs, and large enterprises. It can be deployed as a self-managed solution on-premises, as a virtual appliance, or in a public cloud using the same virtual code. A SaaS offering that’s managed by Netreo is available.
Netreo uses data collectors to ingest data. The ingestion of metrics, events, logs, and traces (MELT) data from OpenTelemetry is currently under discussion.
Netreo automatically fixes problems or provides engineers with suggested actions for remediation. Recommended solutions are based on ML algorithms applied against historical data from each specific Netreo deployment. This is an above average capability in the marketplace.
Netreo delivers a consolidated dashboard displaying the most relevant—and customizable—views of an entire IT environment, either on one page or in “slide view” format for dynamically scrolling through content. Netreo customers can create customized dashboard views with a simple drag-and-drop interface and can assign different views as the home page for any user. An API is available to show additional data.
The unified dashboard allows users to view multicloud and on-premises applications, including services, load balancers, storage, VMs, and infrastructure resources. Service-level management is not available within Netreo AIOps. Task and workflow management can be performed only by using the API to enable integration with other tools. The solution monitors alerts on standard KPIs for SLAs, SLOs, and SLIs, such as availability, response time, throughput, errors, and utilization. These are available via reports, dashboards, alerts, and API. A performance metric data API is available for extracting this data to BI tools, with further integrations planned.
Netreo has no predictive analysis capabilities, though this is a feature more common than not in this market.
Netreo’s core solution licensing is based on a per device per month model. Add-on pricing is based on either tiers related to the number of devices or a fixed fee model. Professional services are not mandatory for deployment. The deployment is agentless, which simplifies many startup tasks, but may increase the load on the security staff to allow access to systems for metric gathering and auto-remediation. The solution can be deployed as a domain-agnostic tool. As a SaaS or cloud-based solution, Netreo scales easily and provides good manageability.
Strengths: Netreo’s auto-remediation is AI-powered, and it is a standout feature. Netreo is a domain-agnostic AIOps solution that scales well and has good manageability and maintainability.
Challenges: There are no predictive capabilities in the solution. The identified emerging technologies are not on the Netreo roadmap. The agentless system may complicate security efforts.
New Relic
New Relic is a technology company founded in 2008 in San Francisco, California. The company provides a cloud-based platform for APM and observability using analytics and AI/ML technologies.
New Relic provides an all-in-one platform solution with more than thirty named features for AIOps, infrastructure monitoring, and application performance management for cloud-based, on-premises, and hybrid environments. It is a pure SaaS solution that targets MSPs, SMBs, and large enterprises.
New Relic can handle a wide range of data sources, including logs, metrics, traces, and events, and using structured and unstructured data within the NewRelic Database, which supports its own query language. In addition, New Relic has built-in support for industry standards such as OpenTelemetry and Prometheus. Out-of-the-box integrations are extensive; however, New Relic supports network monitoring via SNMP, network flows, and network syslog.
New Relic allows users to automate remediation tasks based on specific conditions or triggers. New Relic provides automated remediation via integration with tools such as Ansible or Rundeck, which can be used to automatically execute runbooks or run remediation tactics.
New Relic dashboards have a navigation paradigm that allows for easy drill-down. All dashboards are customizable. They can be saved and shared with others.
New Relic offers a range of ML algorithms to predict potential issues and failures before they occur. The prediction horizon depends on a number of factors, including the amount of historical data available.
New Relic has integrations with all hyperscalers’ cloud monitoring tools. Its agents can be installed on private cloud infrastructure, such as OpenStack or VMware, as well as in on-premises data centers. Users can switch between different cloud providers or services and view performance metrics in a single dashboard.
Users can automatically set SLI baselines or configure custom SLIs and verify performance against SLOs. Alerts are based on error budgets, burn rates, and SLO compliance thresholds, and these metrics can be tracked over time.
BI can be incorporated into any dashboard using New Relic APIs, and popular collaboration tools can be integrated using a low-code environment.
New Relic uses AutoML to enable a view of the data and provide recommendations for correlations for noise reduction, with further improvement indicated on the roadmap. Cybersecurity currently includes vulnerability assessment and compliance management. The roadmap includes interactive application security testing for real-time insights. New Relic has started to add explainable AI to the platform. A new generative AI assistant, New Relic Grok, will provide insight as the product matures. AI bias management is not currently offered.
New Relic offers a free pricing tier with full-platform access and 100 GB/month of free data. Additional plans are based on data storage and/or data consumption. The licensing is moderately complex but allows good predictability of costs and control over data storage. Default data storage life is 30 days, with methods for expanding to 90 days. New Relic does not require the displacement of existing tools, but it is not a domain-agnostic AIOps solution. Both cost and value for New Relic are above average.
Updates in the last year have improved New Relic’s ease of use. The user interface is consistent across modules, and navigation is better than average for an AIOps platform.
New Relic is a SaaS-only platform and scales with the needs of the enterprise. Likewise, manageability is good, as is the multizone, multiregion data replication. The high-availability architecture minimizes interruptions.
Strengths: New Relic has good data ingestions capability. The built-in automated remediation engine plus the ability to use other orchestration tools via API integration are strong points. On May 2, 2023, New Relic Grok, a generative AI assistant for observability, was announced and will be available later this year. The solution scales easily, and the scaling is managed by New Relic. As an all-in-one solution, New Relic may be a perfect match for large enterprises willing to spend the political capital to move to a single vendor.
Challenges: Tool displacement is not required for the entire New Relic experience; however, the all-in-one nature of the solution lends itself to the replacement of redundant tooling. Enterprises looking to have a single vendor will find New Relic a good fit but may face internal challenges with replacing long-standing tooling.
OpenText (formerly Micro Focus) Operations Bridge
Founded in 1991, OpenText recently acquired Micro Focus, which was founded in 1976. Before Micro Focus was itself acquired, it had expanded its offerings through a series of acquisitions, including Attachmate, HPE Software, and Serena Software.
OpenText’s Operations Bridge (OpsBridge) targets MSPs and large enterprises. It is available either as a SaaS offering or self-managed for on-premises deployment. The product is available in the AWS marketplace and can be installed on Azure.
OpsBridge has tools and capabilities to ingest logs, metrics, events, topology, and traces, as well as custom data through its Open Data Ingestion (ODI) tooling and Prometheus. Its roadmap includes support for OpenTelemetry and collectors.
OpenText has its own auto-remediation capabilities via Operations Orchestration capabilities. There is a graphical flow builder that can be used to modify OOTB workflows or create new automations. External orchestration tools are supported via the API.
OpsBridge enables users to modify included dashboards or create custom dashboards, charts, and graphs through a low-code interface.
The solution can generate a prediction line for one or more metrics based on past behavior, and identify seasonal trends out to two weeks into the future. Metric data with seasonality information can be used to predict activity further into the future.
BI integration is handled via the API or by using OpenText’s OPTIC data lake. BI tools can export information to the data lake for inclusion in dashboards and reports.
OpenText provides good multicloud (public and private) support with service mapping across multiple public clouds and the ability to show all data in a single interface. It has no direct support for managing SLAs, but KPIs can be defined and displayed. However, while OpenText provides an ROI dashboard focused on event reduction, it provides no KPI management.
There is a drag-and-drop workflow designer, and integration with other workflow management tools is accomplished via out-of-the-box connections with common task and workflow tools or via OpenText’s API.
Emerging technologies for OpenText include the beginnings of AutoML with a combination of unsupervised and semi-supervised learning. No specific cybersecurity features exist; however, OpenText can consume events from security tools via API integrations. Explainable AI is supported via the company’s AI-driven automatic event correlation (AEC), which enables users to see how the AI made correlations. AI bias management is not supported by OpenText at this time; however, it is aware of how bias can be introduced into monitoring systems.
OpenText licenses according to the number of units consumed by the monitored resources—agent, agentless, synthetic, and real-user employed (for example, 5,000 agentless monitored servers would be 5,000 units, and one real user monitor probe is 20 units). Customers can switch the use of the units at any time. Professional services are not a requirement for deployment success. Tool displacement is not mandatory, though OpenText has offerings to replace the majority of monitoring and observability tooling.
Ease of use is average, with low-code/no-code capabilities for some use cases. The use of domain-specific “management packs” implements best practices by setting parameters needed for that domain. A domain in this case might be SAP, MS SQL Server, or Apache Web Server.
Scaling is easy for both the SaaS and self-managed versions of OpenText due to the containerized deployment mechanisms. Raw data is retained for 30 days, hourly aggregation is retained for 365 days, and the daily aggregation is retained for 1,825 days. One terabyte of storage is included, and customers can purchase additional storage if needed.
High availability and disaster recovery are standard in the SaaS environment. For self-managed implementations, an active-active or active-passive environment can be created. Upgrades are performed using a script or program for each release. The centralized agent manager deploys updates to agents via policies.
Strengths: OpenText has strong auto remediation with good predictive ability. It supports multiple clouds, both public and private. Cost and value are good, as are scaling and managing the solution.
Challenges: The solution should include value metrics and a single view for hybrid implementations. More thought is needed regarding cybersecurity (the company currently views security as a silo issue). Though the solution can be domain-agnostic, a full set of OpenText tools provides greater functionality.
OpsRamp AIOps
Founded in 2014 in San Jose, California, OpsRamp is a technology company that provides a cloud-based platform for IT operations management, using advanced analytics and ML. OpsRamp has expanded its offerings to include AIOps. OpsRamp is now a Hewlett Packard Enterprise company.
The OpsRamp AIOps platform leverages AI and ML to help IT teams proactively monitor and manage their digital infrastructure. A notable aspect of OpsRamp AIOps is its ability to automate event management using ML to identify patterns in IT incidents and resolve them before they impact the end-user experience. Moreover, its integration with other IT management tools allows IT teams to consolidate their monitoring and management activities into a single platform. OpsRamp AIOps also provides intelligent insights and analytics, allowing IT teams to optimize their infrastructure and prevent issues from occurring in the first place.
The OpsRamp platform is a scalable, SaaS-based AIOps application for on-premises and cloud infrastructures that enables resource monitoring and management across business units and locations. Supported public cloud regions include North America, Europe, and Asia. There is no regional support in South America or Africa. The target market includes MSPs, SMBs, and large enterprises. OpsRamp supports multitenancy.
Agents—deployed executables that run on managed resources within on-premises and cloud infrastructures, and on gateways, and virtual appliances that provide secure communications, non-server resource monitoring, and limited data storage for information in transit—are the methods for ingesting data into OpsRamp.
OpsRamp uses automation to act on resource faults, remediating issues in response to events or performing routine maintenance tasks. Automation can be engaged for distinct tasks or a sequence of tasks.
Alert prediction discovers seasonality patterns using ML algorithms and serves as an early-warning system to take preventive action in advance of future alerts. Time frames of up to 90 days are possible, assuming sufficient historical data.
BI integration is possible. The API interface, though not low code, is relatively easy to use. Out-of-the-box integrations are numerous.
The platform supports SLA policy definition for response timeframes for assignees, entity resolution based on priorities, automatic escalation rules for SLA violations and user notifications, and incident and service requests. This is an above average capability.
Task and workflow management are handled using Service Desk. Task execution is automated in Service Desk and runs online tasks on a specific date and time. Automation workflow capabilities include triggers, workflow, and actions.
Value metrics can be created on dashboards. The process is not low-code but is mostly painless. The AIOps event management engine is used to drive incident resolution for value metrics. OpsRamp is a complete platform for infrastructure monitoring, so tool displacement is recommended. An enterprise can use OpsRamp as part of a tool consolidation strategy, thus replacing existing tools. It is not a requirement, however, that all tools be replaced. OpsRamp can work in organizations with multiple redundant tools and integrate them into OpsRamp to consolidate tools. Professional services are available.
The emerging technologies of AutoML, cybersecurity, explainable AI, and AI bias management discussed in this report are not available from OpsRamp.
Strengths: OpsRamp is a unified platform for infrastructure monitoring and event management. Data collection uses both agent and agentless methods, with gateway collectors on-premises. The remediation engine is robust, and there is support for SLA management and multiple clouds. Scalability and manageability are typical for SaaS solutions.
Challenges: The vendor doesn’t offer any of the emerging technologies identified in this report (Table 4), which points to possible weakness in the AI/ML direction for OpsRamp.
PagerDuty AIOps
Founded in 2009 in San Francisco, California, PagerDuty provides a cloud-based platform for digital operations management, using advanced ML, automation, and analytics. The PagerDuty Operations Cloud recently expanded its solution offerings with the launch of a dedicated product, PagerDuty AIOps, in April 2023.
PagerDuty AIOps is a domain-agnostic SaaS-only solution built on the Terraform open-source infrastructure. Its target market includes cloud and network service providers, SMBs, and large enterprises. PagerDuty integrates with more than 750 technology partners and works with any technology stack.
The data ingestions options out-of-the-box are extensive and may verge on “any data from anywhere” capabilities. Data from BI tools, such as Power BI, Snowflake, Looker, and Tableau, can be ingested via the PagerDuty API.
PagerDuty AIOps provides event-driven automation that can be implemented per service or span across services. PagerDuty Visibility Console allows users to add, remove, and resize modules on any dashboard. It filters by team or service and prioritizes various modules, including the Incidents and the service activity modules. The event orchestration feature drives to “next best action,” including diagnostics, incident actions, and remediations, by triggering user-defined webhooks, incident workflows, or on-platform automation actions.
As a real-time solution, PagerDuty does not feature predictive analytics.
PagerDuty acquired Catalytic in 2022 and launched Incident Workflows with a low-code/no-code engine that is available OOTB to assist with human-in-the-middle automation flows. This is a strong feature for PagerDuty.
The solution includes an analytics suite featuring reports, dashboards, and on-call score cards to provide general statistics and health score capabilities out of the box. Metrics include incident count, MTTA, MTTR, response effort, interruptions, notifications, escalations, and reassignments. Each metric can be further broken down by a number of dimensions, including time, incident, technical service, responder, team, and business service.
The cost value evaluation metric registers a number of strong points for this solution. PagerDuty’s AIOps solution can be purchased by seat with the Digital Operations plan, which includes OOTB event management AI/ML features, or by consumption of events with the PagerDuty AIOps SKU, which includes cross-service event management capabilities. PagerDuty is a domain-agnostic solution, so tool displacement is not necessary. Professional services are available but not mandatory for deployment.
Ease of use is above average with a low-code/no-code environment for dashboards and a quick learning curve. Dashboards are customizable at any level. Automated incident workflows are easily configured via a drag and drop UI. Scalability is not a factor due to the SaaS implementation.
PagerDuty addresses AutoML to an extent, but user interaction is still necessary in some use cases. There is currently no direct support for cybersecurity technologies or AI bias management. PagerDuty takes a holistic approach to explainable AI, allowing users to understand AI models before enabling the feature, during triage, and over time as the system and organizational needs change.
Strengths: PagerDuty offers extensive data ingestion capabilities, and its automated remediation is another strong point. Dashboards are customizable and shareable. Task and workflow management are clear strengths. The cost and value proposition is good, with no requirement for professional services. As a domain-agnostic solution, PagerDuty does not replace existing tools. It is easy to use, scales well, and can be easily managed.
Challenges: There are no predictive capabilities and no service-level management. PagerDuty offers the beginnings of the emerging technologies in this Radar, with further support expected.
ScienceLogic SL1 AIOps
ScienceLogic was founded in 2003 in Reston, Virginia. Its platform was developed to help organizations gain real-time visibility into their IT infrastructure, applications, and networks, and the company has expanded its offerings to include AIOps based on its recent acquisition of Zebrium.
ScienceLogic SL1 AIOps can be deployed on customer premises and in customer-managed public clouds, and it is also available via a ScienceLogic-managed SaaS model. SL1 can support any public or private cloud infrastructure, including hybrid environments. Target customers include SMBs, MSPs, and large enterprises.
SL1 brings all data together with operational support for traditional data centers, cloud-native services (Docker, Kubernetes, microservices, and serverless), and hyperscalers (AWS, Azure, Google, and VMware). OpenTelemetry is supported. Data collection includes the initial and continuing discovery of data from various sources–including agents, devices, applications, and services–while the collection process itself should match the type of asset being monitored.
SL1 allows you to specify actions you want it to execute automatically when specific event conditions are met. Automation in SL1 is divided into an automation policy that defines the event condition that triggers an automatic action, and an action policy that can perform tasks.
ScienceLogic provides template dashboards that can be customized using a low-code environment. Dashboards can be saved and shared, and added to a favorites list for quick access. Drill-down can be time- or device-based, with filters to limit the data displayed to what is required.
Predictive capabilities in SL1 can forecast data for a specific device or for a collection of metrics using historical data and selected regression methods. Forecasts can use hourly data or daily rollup data. Regression algorithms include exponential, linear, logarithmic, seasonal drift, and seasonal weighted methods. Multiple regression methods can be selected, and SL1 will determine which regression methods(s) will produce the best forecast.
BI integration is possible with SL1 PowerFlow, a generic platform that enables integration between SL1 and third-party applications, such as ServiceNow, Restorepoint, xMatters, Opsgenie, or Cherwell Service Management. The PowerFlow platform sits between SL1 and the third-party application and handles the flow of data.
Service-level management allows enterprises to create an SLA definition whose threshold is applied to the availability key metric of an IT service policy. Management is performed via the SLA definition editor.
Task and workflow management in SL1 are handled by the PowerFlow builder, which enables users to create low-code workflows. Workflows may be integrated with third-party systems and can include steps with human interactions.
Value metrics can be defined and added to dashboards. Risk of failure is an additional value metric from ScienceLogic.
ML-based anomaly detection and some other features, such as assisted workflow automation, are available only in the ScienceLogic SL1 All-in-one Platform at the SL1 premium level. Thus, to get AI/ML functionality, organizations will need to replace their current tools (tool displacement) or purchase redundant tooling. Data retention is for two years, and additional low-code features are available only with the premium offering. Licensing is per node, per month, with volume pricing available.
Ease of use via consistent low-code dashboards, and workflow management are above average. The platform is easy to scale in the SaaS offering. Documentation and professional services are available for on-premises scaling for suggestions and support. The solution is extensive, and at the premium level, several tools can help to enhance manageability.
The emerging technologies discussed in this report are not available from ScienceLogic, though with the acquisition of Zebrium, this may change in the future.
Strengths: ScienceLogic offers an extensive platform with standout features in a number of areas. The potential for the vendor to be a true leader in AIOps derives from its all-in-one mindset, and enterprises with a similar perspective that are willing to spend the political capital to replace or displace existing tools will benefit from its offerings.
Challenges: AI features are lagging; however, with the Zebrium platform moving toward full integration with SL1, these problems may resolve. Also, SL1 is a complete platform, and some enterprises will not want to replace tools they’re accustomed to.
ServiceNow
ServiceNow was founded in 2004 in San Diego, California. The company provides a cloud-based platform for enterprise service management, including ITSM, HR service delivery, customer service management, and AIOps.
ServiceNow targets SMBs and large enterprises with a SaaS offering hosted in a private enterprise cloud that’s fully owned and operated by ServiceNow. The cloud features a multiple-instance architecture that delivers logical single tenancy by isolating each customer’s data from all the others. This is achieved by using an enterprise-grade cloud architecture and dedicated database and application services per customer instance, ensuring that there is no possibility of co-mingling customer data, unlike in a multitenant architecture with a shared database. Self-hosting is available on an exception basis only. There are two platform updates yearly, and ServiceNow adds features and capabilities frequently. Regular AIOps capability updates are made through the ServiceNow Store throughout the year.
ServiceNow is known for its extensive dashboards, which allow visualization of current data and prediction trends. The process of creating dashboards has improved but still remains challenging until sufficient learning has taken place. ServiceNow has a Service Operations Workspace Express List dashboard, making it easy for IT operations teams to to address alerts from within a single dashboard.
ServiceNow provides connectors to third-party products, such as Dynatrace and New Relic, to import application data. Service Graph Connectors, which are built by ServiceNow and its technology partners, import third-party data into ServiceNow Service Graph. This is the evolution of the ServiceNow CMDB, extending coverage to planning, application development, deployment, performance, cost, and business processes, as well as other areas, by implementing the ServiceNow common service data model (CSDM).
Service-level management is a part of ServiceNow. Task and workflow management use a low-code environment and are known for their ease of use.
A full AIOps implementation from ServiceNow can displace all existing tools; however, the ServiceNow platform provides compelling reasons to consider making ServiceNow the single vendor for your company. Still, the adoption of the entire ServiceNow platform along with the AIOps solution may cost so much in both time and money that it could require extensive political capital to deploy in a large enterprise.
Strengths: ServiceNow offers a strong AIOps platform that enables powerful workflows. Additional modules provide risk management, cybersecurity, IT operations management, and more. The platform has outstanding capabilities in automation, learning systems, dashboards and reports, and systems integration, and it is investigating and expanding into all of the emerging technologies in this report.
Challenges: The UI for ServiceNow can be challenging, though this is mitigated to some extent by an excellent training and certification system. To fully use the ServiceNow AIOps with the extensive ServiceNow portfolio, full tool displacement is required, which may involve extensive political capital investment, careful planning, and professional services to complete; however, the ServiceNow AIOps solution can be used independently of other ServiceNow tools. Upgrades and feature additions may lead to increased costs down the road.
Splunk AIOps
Splunk, founded in 2003 in San Francisco, California, has a history of strategic acquisitions. In 2015, Splunk acquired Metafor, adding ML and anomaly detection capabilities to its platform. This was followed by Phantom in 2017, for security orchestration and automation; VictorOps in 2018, for collaborative incident response; and SignalFx in 2019, for cloud monitoring and observability. Splunk has also made smaller acquisitions, such as KryptonCloud, Omnition, and Plumbr, to add specialized capabilities.
The target market for Splunk AIOps is service providers (cloud, managed, and network), as well as SMBs and large enterprises. Deployment of Splunk Enterprise can be self-managed on physical and virtual appliances. Splunk is not currently supporting deployment of Splunk Observability solutions on public cloud images; however, Splunk Core Platform can be deployed as an image or by using a Kubernetes operator on OpenShift or any CNCF-certified K8S platforms.
OpenTelemetry is the default instrumentation method used for data collected in Splunk Observability Cloud. Splunk can consume data from almost any source, including time series data generated by popular BI software like Microsoft Power BI and Tableau.
Splunk natively provides integration to automation tools like Splunk security orchestration, automation and response (SOAR) for orchestrated workflows, auto remediation and automated actions. Splunk can also integrate with external automation tools such as Ansible, Puppet, and UiPath via an API. It provides out-of-the-box customizable dashboards for operators, practitioners, and stakeholders.
Predictive analytics within Splunk uses ML algorithms to forecast the health score value of a selected service in Splunk IT Service Intelligence (ITSI). The AI models use historical service health scores and KPI data to approximate what a service’s health might look like in 30 minutes.
Splunk is working on out-of-the-box guided workflows to set up, manage, and troubleshoot breached SLOs. Service-level management is not a part of the Splunk solution; however, KPI management at the service level is available.
Splunk Incident Intelligence, a feature of the Splunk Observability Cloud, provides on-call management (scheduling, routing, escalation, and notification) to diagnose and remediate incidents. It also provides integrated ChatOps to enable teams to collaborate more efficiently.
Splunk IT Service Intelligence (ITSI) is an analytics and IT management solution that allows organizations to predict incidents. ITSI is a premium product for Splunk Enterprise or Cloud. ITSI provides real-time and predictive performance dashboards. Its episode view feature displays a dashboard that dynamically provides value metrics around mean time to acknowledge (MTTA) and mean time to resolve (MTTR), as well as the percentage of noise reduction. Splunk ITSI works as a manager of managers, providing visibility across existing monitoring and observability tools without the need to replace them.
Examining the evaluation metrics for Splunk AIOps shows a cost value that is average for AIOps solutions. Licensing is based on the hosts for Splunk Observability cloud, the volume of data ingested, and the workload. The primary factors for workload pricing are the compute capacity or resources consumed for different search and analytic workloads, and the control to optimize workloads. Professional services are not required to deploy a successful solution.
Splunk’s ease of use is average, though training is highly recommended for administrators. A no-code/low-code environment is not available for data ingestion, but there is extensive support for common data streams.
Scaling the SaaS portions of the Splunk AIOps solution is painless, as is BC/DR. Splunk has a documented disaster recovery plan, which is reviewed and approved annually. For the on-premises portion of the AIOps solution, enterprises can get started with simple installer scripts or HELM charts and scale their lightweight observability agents via configurations in third-party automation tools such as Ansible, Puppet, and Chef to manage Splunk. The OpenTelemetry collector acts as an agent. Customers manage the full lifecycle of Splunk agents deployed in their environment.
Splunk does not currently have complete AutoML capabilities. Automated adjustments to models can be made using the partial_fit command that updates models with new data. For other use cases such as anomaly detection, the distribution (dist) is automatically detected using the “auto” setting, where the best parameter is chosen based on the ideal data.
Splunk’s security solutions include the Splunk big data analytics platform, the Enterprise Security SIEM application, the behavioral analytics cloud service, the threat intelligence management cloud service, and Splunk SOAR for orchestration and response.
The Splunk ML Toolkit is not an explainable AI solution but a method to build new AI models using over 30 different algorithms. The ML toolkit also contains “Smart Assistants” that provide a guided experience to users so they can operationalize ML models without requiring any ML or SPL domain expertise.
Strengths: Splunk can ingest data from almost any source and includes OpenTelemetry. It supports multiple clouds, public and private. Scalability and manageability for the SaaS portions of the Splunk solution are good.
Challenges: Service-level management needs improvement. AutoML is not supported, but the use of Smart Assistants allows exploration of AI models. The complicated hybrid solution, with parts of the solution using SaaS while others are self-managed, can be difficult for some enterprises and thus may require professional services.
Sumo Logic AIOps
Sumo Logic, founded in 2010 in Redwood City, California, has a global presence, with offices and customers in multiple countries around the world.
Sumo Logic is a cloud-native, multitenant SaaS solution hosted on AWS and available in multiple regions. It supports MSPs, SMBs, and large enterprises.
Data collection and analysis are robust and include OpenTelemetry. The Sumo Logic OTel Collector supports a wide range of data sources. Out-of-the box integrations are plentiful, and the support for OpenTelemetry allows easy collection of metrics, events, log, and traces from any source.
Sumo Logic allows events to suggest (or trigger) an automated response (human-in-the-loop or not). This is better than average for AIOps tools and a strong point for Sumo Logic. The native automation engine and methods can be enhanced with the use of external automation and orchestration tools using APIs.
Sumo Logic dashboards allow you to analyze metric and log data. A dashboard view is used to build, maintain, and interact with dashboards and build data panels inside the dashboard. Customization and sharing are on par with other AIOps solutions.
Prediction capabilities in Sumo Logic are limited to a single metric at a time via a low-code interface. Prediction supports linear and auto-regressive models. Forecasts are limited to 50 data points. The prediction limit creates restraints that limit the usefulness of the forecasting. Though Sumo Logic meets the bar for having predictive capabilities, the constraints may make the feature less valuable.
Sumo Logic integrates with a number of third-party tools. BI integration can be achieved using APIs, though there are out-of-the-box integrations for common BI tools. This is average for most AIOps tools. The integration environment is low-code, which may make the integrations easier.
Sumo Logic can ingest data from all major cloud providers. That data can be viewed separately or in a single view. Private clouds are also included in the single view in this SaaS deployment, providing a full view of all cloud applications.
Sumo Logic provides a service level and reliability solution to manage SLOs and SLIs based on the OpenSLO standard. This allows integration of SLO with DevOps processes and is a better than average implementation of service-level management. OpenSLO supports GitHub and may require additional technical skills for the initial implementation.
The workflow definition environment includes both low-code and no-code features via the Sumo Logic cloud SOAR automation module.
Value metrics from Sumo Logic concentrate on the DevOps research and assessment (DORA) metrics, with built-in dashboards to display DORA data. Other value metrics can be defined.
Sumo applies autoML to security use cases for hyperparameter tuning, model drift detection, and remediation. The company’s roadmap includes generalizing these capabilities for other use cases.
Sumo Logic is also a security platform for security engineers, threat responders, and threat hunters, offering cloud security analytics for threat analytics and investigations, cloud SIEM for high fidelity threat detection, and cloud SOAR for automated incident response.
Explainable AI is used internally but is not exposed to customers. Future releases may have more explainable AI features.
AI bias mitigation strategies are used internally by Sumo Logic. A future release may expose bias mitigation to customers.
In terms of cost and value, Sumo Logic provides a flexible value-based licensed model based on Cloud Flex Credits (CFC). Each service agreement is based on a number of credits for a specific term. These credits can be used for various product features, depending on customer needs.
There are five subscription packages available, ranging from Sumo Free to Enterprise Suite, which leverage this model. Credit-eligible components are based on the amount of data ingested by type of telemetry (logs, metrics, traces) and by tier of storage (continuous, frequent, and infrequent access).
Professional services are not required, and Sumo Logic Free users require no assistance. Tool displacement is not required. Sumo Logic is an OpenTelemetry platform, and ingestion of non-OTel sources is not difficult.
Ease of use is average for an AIOps solution. The combination of minimal coding, low-code, and no-code capabilities simplifies using Sumo Logic.
As a SaaS platform, scaling is simple and handled by Sumo Logic. Trace data retention is limited to 15 days. The default data retention for raw logs is 30 days and is configurable up to 5,000 days with additional cost. Cloud SIEM enterprise (CSE) data has various retention periods; CSE records are stored for 90 days, and there is no additional charge for storage; CSE signals are stored for two years. CSE insight data has a default retention of 30 days, and there is a cost for expansion.
The platform is a highly available SaaS offering managed by Sumo Logic; however, organizations update their deployed agents using their own automation (Terraform, Ansible, and so forth).
Strengths: Sumo Logic supports data collection using OpenTelemetry, and multiple clouds, public and private, are supported. There is service-level management, a task and workflow module, and good value metrics. As a SaaS solution, scaling and management are good. Other strong areas are alerting, alert grouping and alert response, root cause exploration, and global intelligence benchmarking.
Challenges: Explainable AI and AI bias management are limited. Prediction and forecast metrics are constrained to 50 future data points.
Zenoss
Zenoss Inc., which specializes in IT monitoring and AIOps, was founded in November 2005 in Austin, Texas. Zenoss combines original programming and several open source projects to integrate data storage and data collection processes with a web-based user interface.
Zenoss delivers full-stack monitoring along with AIOps, collecting all types of machine data, including metrics, dependency data, events, streaming data, and logs, to build real-time IT service models that train ML algorithms to deliver AIOps analytics capabilities. The target markets are large enterprises, SMB, and MSPs.
The AIOps solution has a SaaS-only deployment that delivers Zenoss using Amazon Web Services. For customers who require an on-premises deployment, the Zenoss on-premises platform (Zenoss Service Dynamics) offers configurability and customization but lacks some of the features of the cloud offering. There’s no consolidated dashboard for enterprises wishing to use both.
Via its collector framework, which can be installed in any public cloud region, Zenoss supports both agent-based and agentless collection of metric, model, and event data. Collectors use generic protocols (ICMP, SNMP) or customized protocols (API calls) to gather data. The framework is a small, pre-configured Kubernetes deployment that runs under MicroK8s on a host in customer environments. Zenoss Collector communicates with Zenoss Cloud using a standard HTTPS connection. Agentless collection is supported with ZenPack Adapter, which enables the use of Zenoss ZenPacks without a collection zone.
Data from BI tools can be integrated with Zenoss APIs or ZenPacks. Multiple public clouds are supported along with private clouds. Service-level management is not a part of Zenoss; however, thresholds can be created for a device class, individual device, or individual component. Zenoss integrates with common task and workflow management tools. Value metrics can be defined and inserted into custom dashboards.
Zenoss does not have an automated remediation functionality, but it does integrate with provisioning, orchestration, and automation solutions. All dashboards are created by the enterprise. Zenoss provides smart view pages to display graphs and charts of every entity in the environment, including metrics and events during the time range selected, whether in minutes or months. Smart View uses ML to find unusual patterns and changes in resource consumption from event and performance streams; explore historical data in context with living, dynamic dependency modeling; predict service impacts from any failing component and related resources; and explore a dependency graph of related entities. Predictive analytics using Smart View can predict service impacts from any failing component and related resources.
Licensing is provided at two levels: professional and enterprise. The professional level is for smaller, simpler environments with limited data needs and only basic support. The enterprise level is for larger, dynamic environments with large-scale data requirements and premium support. The default data retention is for three months and 15 months respectively. Additional data storage can be purchased. The number of data points per day is 40,000 for professional level and 120,000 for enterprise. Increases in 20,000 increments are available. Professional services are available. Tool displacement is not required; however, Zenoss data collectors must be deployed, which can cause redundant data generation.
Ease of use is variable. Users can create dashboards using the product’s low-code capabilities, but substantial knowledge of the environment is necessary. The smart views are easy to navigate but are less customizable. The SaaS solution scales easily and is easy to manage, but the on-premises version requires careful consideration before scaling and a proof of concept or assurances for BC/DR management.
The emerging technologies discussed in this report are not available from Zenoss.
Strengths: Zenoss proves good data ingestion and monitoring with multicloud support. Tool displacement is not required.
Challenges: There is limited prediction capability. Service-level management is not provided. Dashboards and smart views are not as easy to use as they should be and make visualization of data complicated. There is no auto-remediation support.
ZIF
ZIF, a relatively new technology company that was founded in 2021 in San Francisco, California, provides the Zero Incident Framework (ZIF), an AI-powered platform for IT operations management that uses ML algorithms to automate and optimize IT operations processes. The platform also includes features such as predictive analytics, anomaly detection, and automated incident response. Its customers may be SMBs, MSPs, or large enterprises. Deployment is via SaaS only.
ZIF is an AIOps platform that can stream real-time data from various monitoring, ITSM, and log management tools. Data ingested into the platform is aggregated, and supervised and unsupervised ML algorithms are run on the streaming data to generate analytics in real time. ZIF has bots that can auto-heal incidents, based on the predictive analytics driven by the ML algorithms.
The ZIF platform has a universal connector, a low-code component that can integrate with any third-party tools to ingest data. Agent and agentless data ingestion is supported. ZIF can ingest any format of data using the universal connector in real time. Integration with BI tools is possible using the universal connector. There is no support for OpenTelemetry.
Auto-discovery and mapping of applications to devices is done by ZIF Discovery. Business process to application mapping must be done manually by business owners.
ZIF has an integrated automation module that can resolve incidents either automatically or with manual triggers. ZIF bots can be developed using any scripting language and uploaded into ZIF. They can be reused in workflows that can be created using ZIF’s workflow creator, a low-code component. Twenty-three use cases are supported out-of-the-box, with others on the roadmap.
ZIF’s AI capabilities enable it to show all infrastructure data within a single dashboard. Multiple dashboards can be added, and with auto-rotation turned on, dashboards can be swapped to show different information. ZIF also offers predefined dashboard templates that can be customized to create new dashboards.
ZIF uses patented ML algorithms in the predictive analytics module, which employs unsupervised ML algorithms to predict a potential failure or performance degradation in devices or applications. ZIF predicts failures up to 60 minutes into the future. Resolving the predicted issues can be done manually or with automation.
ZIF Insights provides current and predictive views of business processes and services. Multicloud support is available for private and public clouds via the universal connector. Service-level management is not a feature of ZIF, but task and workflow management is available both as native workflows and integration with other tools. Value metrics can be calculated but are not a native feature of ZIF.
The cost/value proposition for ZIF is governed by its licensing models. There is a perpetual license with an estimated TCO of three to four years. The subscription model is based on a per-device cost. Professional services are available. ZIF is essentially domain agnostic. It can gather data from any source and does not require the displacement of existing tools.
The solution makes use of low-code environments whenever possible, and the interface is consistent throughout the product. As a SaaS solution, it scales easily, though data transport costs may be incurred depending on the enterprise agreements with carriers. The solution is managed by ZIF with little burden on the enterprise.
Basic autoML is present, so users do have few manual configuration opportunities. For example, the definition of business process to business applications must be done by hand. Other emerging technologies are not part of the current ZIF feature set.
Strengths: ZIF provides good data ingestion capabilities. Automatic remediation is part of the solution and can interact with external tools. Custom dashboards are good, as is task and workflow management using ZIF bots. Tool displacement is not required.
Challenges: There is limited support for emerging technologies and no consumption-based pricing. There is no support for OpenTelemetry.
6. Analyst’s Take
The AIOps world is more crowded than in previous years, with more viable options than ever. Generative AI is the buzz in 2023. This report dips its toes into those waters by looking at emerging technologies such as autoML, AI bias management, explainable AI, and another area where AI may have profound impacts in the future, cybersecurity. Generative AI is expected to become a feature as soon as these capabilities become available to vendors.
The displacement of incumbent tools continues to be a concern with AIOps. The platform players perform best when their tools control monitoring, alert, and event data. For large enterprises, the political capital necessary to displace organically sourced and developed tooling is substantial. Domain-agnostic tools may be better solutions. These can be layered on top of existing tools with little or no redundancy. Even so, internal friction may slow the ingestion rate for data flows from everywhere in the environment.
Domain-specific AIOps tools handle it all, monitoring, alerts, events, automation, and often workflow management. The cost of gaining consensus may be less than the gains possible with a holistic approach. In these cases, a platform solution is ideal.
All AIOps solutions need some improvement in predictive analytics. There are vendors pushing ahead, but the next generation of AI will allow forecasting to reach new levels that will springboard automated remediation and BI, and enable more automated management of the enterprise.
The value of AIOps has always been to allow operations teams to deal with the data deluge in the digital age. With organizational awareness as a focus, AIOps has the ability to unify data from IT operations with business processes, allowing companies to fully and accurately answer the question, “Are we OK?”
7. Methodology
*Vendors marked with an asterisk did not participate in our research process for the Radar report, and their capsules and scoring were compiled via desk research.
For more information about our research process for Key Criteria and Radar reports, please visit our Methodology.
8. About Ron Williams
Ron Williams is an astute technology leader with more than 30 years’ experience providing innovative solutions for high-growth organizations. He is a highly analytical and accomplished professional who has directed the design and implementation of solutions across diverse sectors. Ron has a proven history of excellence propelling organizational success by establishing and executing strategic initiatives that optimize performance. He has demonstrated expertise in planning and implementing solutions for enterprises and business applications, developing key architectural components, performing risk analysis, and leading all phases of projects from initialization to completion. He has been recognized for promoting effective governance and positive change that improved operational efficiency, revenues, and cost savings. As an elite communicator and design architect, Ron has transformed strategic ideas into reality through close coordination with engineering teams, stakeholders, and C-level executives.
Ron has worked for the US Department of Defense (Star Wars initiative), NASA, Mary Kay Cosmetics, Texas Instruments, Sprint, TopGolf, and American Airlines, and participated in international consulting in Qatar, Brazil, and the U.K. He has led remote software and infrastructure teams in India, China, and Ghana.
Ron is a pioneer in enterprise architecture who improved response and resolution of enterprise-wide problems by deploying “smart” tools and platforms. In his current role as an analyst, Ron provides innovative technology and strategy solutions in both enterprise and SMB settings. He is currently using his expertise to analyze the IT processes of the future with particular interest in how machine learning and artificial intelligence can improve IT operations.
9. About GigaOm
GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.
GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.
GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.
10. Copyright
© Knowingly, Inc. 2023 "GigaOm Radar for AIOps" is a trademark of Knowingly, Inc. For permission to reproduce this report, please contact sales@gigaom.com.