This GigaOm Research Reprint Expires Mar 10, 2024

GigaOm Radar for Cloud Observabilityv3.02

1. Summary

Monitoring and observability are crucial IT functions that help organizations keep systems up and running and performance levels high. Knowing that a problem is likely to occur means that it can be rectified before it impacts systems. Monitoring can tell you about performance and identify “break/fix” conditions of failure and remedy. Observability seeks to determine why devices, systems, or applications are behaving a certain way within the context of IT. Observability also has a role to play in monitoring end user experiences—knowing there is an issue may explain why customer journeys are being terminated at a particular point.

Operational awareness brings together all the various information streams within an organization, including IT data, to determine how systems are performing, what is likely to break next, and how to prevent problems before they have a chance to impact performance or availability.

Maintaining operational awareness is difficult enough in a single-cloud environment. When multiple cloud vendors are involved, it becomes exponentially more complicated. Most organizations’ cloud deployments are not limited to a single public cloud but include private clouds hosted on-site and hybrids of public and private clouds as well. Unlike on-premises infrastructure, cloud deployments can change quickly based on application needs, finances, and performance. Therefore, maintaining operational awareness has never been more important or more complex. This is where cloud observability comes in.

Cloud observability centers around having deep visibility into the systems, applications, and services in your distributed cloud environment. Cloud observability solutions provide performance monitoring, reporting, predictive analytics, and cost analysis; all of these are capabilities that contribute to IT and operational awareness.

This GigaOm Radar report highlights key cloud observability vendors and equips IT decision-makers with the information needed to select the best fit for their business and use case requirements. In the corresponding GigaOm report “Key Criteria for Evaluating Cloud Observability Solutions,” we describe in more detail the key features and metrics that are used to evaluate vendors in this market.

How to Read this Report

This GigaOm report is one of a series of documents that helps IT organizations assess competing solutions in the context of well-defined features and criteria. For a fuller understanding, consider reviewing the following reports:

Key Criteria report: A detailed market sector analysis that assesses the impact that key product features and criteria have on top-line solution characteristics—such as scalability, performance, and TCO—that drive purchase decisions.

GigaOm Radar report: A forward-looking analysis that plots the relative value and progression of vendor solutions along multiple axes based on strategy and execution. The Radar report includes a breakdown of each vendor’s offering in the sector.

2. Market Categories and Deployment Types

To better understand the market and vendor positioning (Table 1), we assess how well cloud observability solutions are positioned to serve specific market segments and deployment models.

For this report, we recognize the following market segments:

  • Small-to-medium business (SMB): In this category, we assess solutions on their ability to meet the needs of organizations ranging from small businesses to medium-sized companies. Also assessed are departmental use cases in large enterprises, where ease of use and deployment are more important than extensive management functionality, data mobility, and feature set.
  • Large enterprise: Here offerings are assessed on their ability to support large and business-critical projects. Optimal solutions in this category have a strong focus on flexibility, performance, data services, and features to improve security and data protection. Scalability is another big differentiator, as is the ability to deploy the same service in different environments.

In addition, we recognize three deployment models for solutions in this report:

  • Software as a service (SaaS): These solutions are available only in the cloud. Often designed, deployed, and managed by the service provider, they are available only from that specific provider. The big advantage of this type of solution is the integration with other services offered by the cloud service provider (functions, for example) and its resulting simplicity.
  • On-site: These solutions are deployed on customer-owned infrastructure and managed by the enterprise.
  • Hybrid and multicloud solutions: These solutions are meant to be installed both on-premises and in the cloud, allowing organizations to build hybrid or multicloud infrastructures. Integration with a single cloud provider could be limited compared to the other option and more complex to deploy and manage. On the other hand, this approach is more flexible, and the user usually has more control over the entire stack regarding resource allocation and tuning.

Table 1. Vendor Positioning

MArket Segment

Deployment Model

SMB Enterprise SaaS On-Site Hybrid
Broadcom
Chronosphere
Cisco AppDynamics
Datadog
Dynatrace
Elastic
Grafana Labs
Honeycomb
IBM Instana
LogicMonitor
Logz
Microsoft
NetApp
New Relic
OpenText (Micro Focus)
Solarwinds
Splunk
Stackstate
Sumo Logic
VMware
3 Exceptional: Outstanding focus and execution
2 Capable: Good but with room for improvement
2 Limited: Lacking in execution and use cases
2 Not applicable or absent

3. Key Criteria Comparison

Building on the findings from the GigaOm report, “Key Criteria for Evaluating Cloud Observability Solutions,” Tables 2, 3, and 4 summarize how each vendor included in this research performs in the areas we consider differentiating and critical in this sector.

  • Key criteria differentiate solutions based on features and capabilities, outlining the primary criteria to be considered when evaluating a cloud observability solution.
  • Evaluation metrics provide insight into product’s evaluation metrics—the top-line characteristics that define the impact each will have on the organization.
  • Emerging technologies and trends identify the most compelling and potentially impactful technologies and trends emerging over the next 12 to 18 months with the potential to either disrupt or differentiate the solution and provider landscape.

The objective is to give the reader a snapshot of the technical capabilities of available solutions, define the perimeter of the market landscape, and gauge the potential impact on the business.

Note: The key criteria and evaluation metrics we reviewed as part of this Radar report are largely the same as used in our 2022 Radar report. However, while vendors that were lacking capabilities last year have largely caught up, most that were positioned as Leaders last year have not advanced their solutions at a similar rate. This shows that we’re at a crossroads in terms of capabilities and that there are fewer differentiators among vendors this year.

Therefore, we approached our review differently, scoring vendors more strictly than in our last report to better highlight differences among offerings. As a result, it is more difficult for a vendor to score a 3 this year for key criteria, evaluation metrics, and emerging technologies.

However, vendor roadmaps are promising and indicate the market is on the cusp of change. It’s likely that features currently regarded as emerging technologies will soon become more prevalent and the current key criteria (differentiators) in turn, are likely to become table stakes (required functionality). We anticipate a new set of key criteria will emerge in the next Radar report.

Table 2. Key Criteria Comparison

Key Criteria

Dashboards & Reports Inventory User Interaction Performance Monitoring Multicloud Resource View Predictive Analysis Microservices Detection Support for Multiple Public Clouds
Broadcom 2 3 2 3 2 2 2
Chronosphere 2 2 1 2 1 2 1
Cisco AppDynamics 2 3 2 3 2 2 2
Datadog 2 2 2 3 2 2 2
Dynatrace 2 2 3 3 2 2 2
Elastic 2 2 2 2 2 2 2
Grafana Labs 2 2 1 2 2 2 2
Honeycomb 2 2 1 2 3 2 2
IBM Instana 2 2 2 2 2 2 2
LogicMonitor 2 1 2 2 2 2 2
Logz 2 2 1 2 1 1 2
Microsoft 2 2 2 2 2 2 2
NetApp 2 2 1 2 3 2 2
New Relic 3 2 2 3 2 2 2
OpenText (Micro Focus) 2 2 2 2 2 2 2
Solarwinds 2 3 2 1 2 2 2
Splunk 2 2 2 3 2 2 2
Stackstate 2 2 1 1 2 2 2
Sumo Logic 2 2 2 3 2 2 2
VMware 2 2 2 2 2 2 2
3 Exceptional: Outstanding focus and execution
2 Capable: Good but with room for improvement
2 Limited: Lacking in execution and use cases
2 Not applicable or absent

Table 3. Evaluation Metrics Comparison

Evaluation Metrics

Deployment Ease Flexibility Ease of Use Security
Broadcom 2 3 1 2
Chronosphere 2 2 1 2
Cisco AppDynamics 2 2 2 2
Datadog 2 3 2 2
Dynatrace 3 3 1 2
Elastic 2 2 2 2
Grafana Labs 2 2 2 2
Honeycomb 2 2 1 2
IBM Instana 2 2 2 2
LogicMonitor 2 2 2 2
Logz 1 2 2 2
Microsoft 2 2 2 2
NetApp 2 1 2 3
New Relic 2 2 2 3
OpenText (Micro Focus) 2 2 2 2
Solarwinds 2 2 2 2
Splunk 2 2 2 2
Stackstate 2 1 3 2
Sumo Logic 2 2 2 2
VMware 2 2 2 2
3 Exceptional: Outstanding focus and execution
2 Capable: Good but with room for improvement
2 Limited: Lacking in execution and use cases
2 Not applicable or absent

Table 4. Emerging Technologies Comparison

Emerging Tech

Distributed AI/ML Identification of Shadow Changes
Broadcom
Chronosphere
Cisco AppDynamics
Datadog
Dynatrace
Elastic
Grafana Labs
Honeycomb
IBM Instana
LogicMonitor
Logz
Microsoft
NetApp
New Relic
OpenText (Micro Focus)
Solarwinds
Splunk
Stackstate
Sumo Logic
VMware
3 Exceptional: Outstanding focus and execution
2 Capable: Good but with room for improvement
2 Limited: Lacking in execution and use cases
2 Not applicable or absent

By combining the information provided in the tables above, the reader can develop a clear understanding of the technical solutions available in the market.

4. GigaOm Radar

This report synthesizes the analysis of key criteria and their impact on evaluation metrics to inform the GigaOm Radar graphic in Figure 1. The resulting chart is a forward-looking perspective on all the vendors in this report, based on their products’ technical capabilities and feature sets.

The GigaOm Radar plots vendor solutions across a series of concentric rings, with those set closer to the center judged to be of higher overall value. The chart characterizes each vendor on two axes—balancing Maturity versus Innovation, and Feature Play versus Platform Play—while providing an arrow that projects each solution’s evolution over the coming 12 to 18 months.

Figure 1. GigaOm Radar for Cloud Observability

As you can see in the Radar chart in Figure 1, the cloud observability market is highly competitive, and this year has seen a lot of focus by vendors on closing gaps in functionality they are offering versus their peers.

In the Maturity/Platform Play quadrant, we have a large number of vendors in a tight grouping in the Leaders circle. These vendors have cohesive, well-developed offerings with strong functionality that can be applied to a wide range of cloud services. Over the next 12 to 24 months, this group is likely to further consolidate their offerings and mature existing capabilities rather than focus on enhancements and innovation in relative terms. So, the choice for many users will be based on subtleties in the way that vendors deliver this functionality to suit the size and specific requirements of an organization, existing relationships, wider vendor offerings beyond observability, and potential for tool displacement. In other words, while Leaders like Dynatrace are well-suited for a variety of organizations, they won’t be the best fit for every use case. For example, a non-technical SMB would likely be better served by a different solution.

In the Innovation/Platform Play quadrant, we have seen notable advancement in vendor capabilities over the last year, and we expect this trend to continue with particular focus on areas like unified instrumentation, container based networks, and analytics options. The likes of New Relic and NetApp are set to continue to advance with their roadmaps in these areas over the next 12 to 24 months, and we expect emerging technologies to look significantly different in our next Radar report as a result of these and other vendors pushing the market forward.

In the Maturity/Feature Play quadrant there are strong vendor solutions characterized by open source solutions and cloud providers that offer highly capable if slightly less complete offerings at present.

The Innovation/Feature Play quadrant houses a handful of vendors in the Challengers circle that are pushing hard into the observability space. These are new vendors in the cloud observability space that have recently come to market or are established vendors that have expanded their product offerings to include cloud observability. Given the strong competition in this space, we would expect significant innovation over the next 12 to 24 months as vendors build out their solutions further and establish themselves as credible alternatives to the incumbents.

It’s important to note that there’s no “bad quadrant” in Figure 1. Leaders and Challengers alike may be the best fit for your organization based on your specific use case and business requirements. In an established market such as this, purchase decisions will largely depend on the type of licensing available, whether professional services are needed and available for deployment and implementation, the number of existing tools displaced, and the estimated time to value.

Inside the GigaOm Radar

The GigaOm Radar weighs each vendor’s execution, roadmap, and ability to innovate to plot solutions along two axes, each set as opposing pairs. On the Y axis, Maturity recognizes solution stability, strength of ecosystem, and a conservative stance, while Innovation highlights technical innovation and a more aggressive approach. On the X axis, Feature Play connotes a narrow focus on niche or cutting-edge functionality, while Platform Play displays a broader platform focus and commitment to a comprehensive feature set.

The closer to center a solution sits, the better its execution and value, with top performers occupying the inner Leaders circle. The centermost circle is almost always empty, reserved for highly mature and consolidated markets that lack space for further innovation.

The GigaOm Radar offers a forward-looking assessment, plotting the current and projected position of each solution over a 12- to 18-month window. Arrows indicate travel based on strategy and pace of innovation, with vendors designated as Forward Movers, Fast Movers, or Outperformers based on their rate of progression.

Note that the Radar excludes vendor market share as a metric. The focus is on forward-looking analysis that emphasizes the value of innovation and differentiation over incumbent market position.

5. Vendor Insights

Broadcom

Broadcom Inc. is a global infrastructure technology company with 50 years of experience. Broadcom’s AIOps and Observability platform provides full-stack observability of the digital experience (including mobile and web applications) and monitors cloud-native architectures, hybrid infrastructures, and network services. The platform has an integrated analytics engine that can interpret cross-domain monitoring data, including metrics, alarms, inventory, topology, and relationships. Licensing is per a normalized “devices” metric based on the number of monitored entities across the stack. Multiple deployment options are supported, with a choice of the Broadcom SaaS offering hosted on the Google Cloud GKE platform, on-premises, in an on-premises cloud-native deployment model supporting Kubernetes, OpenShift, or a hybrid model.

The AIOps platform comprises several products including DX Application Performance Management (APM) and DX Operational Intelligence.

  • DX APM is an enterprise-scale monitoring and insights solution that provides observability across the application landscape.
  • DX Operational Intelligence enhances the APM capabilities natively for operational use cases, AIOps, and observability scenarios. It comprises DX APM and App Synthetic Monitoring (ASM) and DX Dashboards (from Grafana), which provides dashboards out of the box.

End-user transaction monitoring is supported in a number of ways. App Experience Analytics supports full real user monitoring (RUM) for HTTP and mobile requests, and it provides full-session replay and funnel analysis capabilities. Synthetic transaction monitoring is achieved through the script and recording-based App Synthetic Monitor (ASM), using 90 monitoring stations worldwide, the full Google analytics KPI set, combined with geographic location capabilities integrated into the transaction correlation and system health analysis. All transactions are individually tracked end-to-end and correlated from mobile to mainframe. On-premises monitoring stations work in the same way and enable walled security garden scenarios through trusted HTTPS connections to SaaS.

Broadcom’s inventory capabilities are extensive, with the ability to monitor mainframe to public cloud, although additional products may be necessary to access mainframe data. The Zero Touch Unified Monitoring Agent auto-discovers cloud, microservices, and container environments. Dedicated platform agents are used in non-containerized or cloud environments to perform a number of tasks, including discovering infrastructure components and reporting on all minimum metrics. End-to-end health and performance insights are available to provide granular analysis of the applications, infrastructure, and network components. Unique entities are correlated to show network and infrastructure performance data and analysis in a single view.

Organizations with smaller monitoring environments, restricted to on-premise installs, with no shared cluster environments, may find the hardware requirements to be a high barrier to entry for an initial enterprise-scale orchestrated platform, running an orchestrated microservices-based solution in a native containerized model. Documentation is difficult to find.

Strengths: Broadcom has outstanding inventory capabilities and supports all environments from mainframe to cloud, enabling mainframe organizations to use a single application to monitor their entire infrastructure. It is also strong across the rest of the key criteria, including user interaction performance monitoring, with capabilities that enable IT to monitor the performance of applications to ensure that they run optimally and do not impact end users.

Challenges: The Broadcom AIOps and Observability solution is difficult to research. The marketing and documentation are fragmented. Organizations that wish to use the solution with mainframe environments may need to purchase additional applications.

Chronosphere

Chronosphere’s origins go back to 2014 when founders Martin Mao and Rob Skilington were running the monitoring team at Uber. They built an in-house solution for metrics using open source tools, then added logs and traces for Uber’s cloud-native environment. In 2019, they decided to commercialize it as a SaaS solution and created Chronosphere. Currently, the solution is suited to both SMBs and large enterprises.

Chronosphere uses multiregion replication and a single-tenant architecture, and it can scale to billions of metrics. Chronosphere ingests metrics generated by containerized infrastructure, microservices, applications, and business services, using Prometheus or OpenTelemetry and queries those metrics with PromQL. Teams receive alerts and are given the context they require to address incidents. From there, they can drill into distributed trace data to understand the root cause of an incident. The Chronosphere Control Plane allows organizations to decide what data should be retained and for how long, and the company’s revenue model means that organizations pay only for the data that is retained. The solution does not ingest logs, but log data from other applications can be viewed.

A dynamic topology map is available showing the services that are relevant to a particular user, enabling them to drill down through different levels to look at the elements that are important to them (such as a service operation, errors, or requests) and to understand the dependencies based on what is happening in real time.

The solution is optimized for engineers and provides them with the information they require in the right context; they typically receive a notification of an issue, triage the issue, and discover the root cause.

Chronosphere Collector is used to ingest metrics with one Collector per node. The Collector adheres to open source standards and can be integrated with PagerDuty, Slack, OpsGenie, and webhooks.

Users are able to import Prometheus AlertManager alerts and Grafana dashboards, with full support for PromQL and Graphite. Automatically generated dashboards are available out of the box, but users can also create or customize dashboards and drill down into dashboards for further information. When Chronosphere detects a slow query that has perhaps been written inefficiently and could be made faster with aggregation, Chronosphere creates a pre-aggregated time series of that data point, which will be much faster every time it is queried. Chronosphere then looks for the same query across all of the dashboards and swaps the existing queries with the faster pre-aggregated one.

Strengths: Chronosphere has a number of good capabilities, including dashboards and reporting, microservices discovery, multicloud resource views, and inventory.

Challenges: Chronosphere is weak in the areas of user interaction performance monitoring and predictive analysis, for which there are no artificial intelligence and machine learning (AI/ML) capabilities.

Cisco AppDynamics

Cisco moved into the observability market with its acquisition of AppDynamics and ThousandEyes (which provide comprehensive digital experience monitoring), and the subsequent integration of Cisco’s Intersight cloud operations platform. The AppDynamics solution is geared toward mid-sized to large enterprises and appeals to financial, retail, and IT service customers. The company has transitioned AppDynamics from an APM solution to an observability platform by adding cloud, network, and infrastructure monitoring capabilities. The solution can visualize revenue paths and correlate customer and application experience to find and fix application issues. It can also monitor errors using its cognition engine, isolate problematic domains, and identify root causes from snapshot data by scanning all instances of collected telemetry in the dependency tree using the automated transaction diagnostic feature.

Cisco AppDynamics provides a mature monitoring and observability solution for traditional, hybrid, and cloud-native applications, available as an on-premises software solution. Using both proprietary agents and OpenTelemetry, it provides visibility across the application, infrastructure, network, and security stacks. Integration with multiple Cisco offerings enables full-stack observability. What differentiates this solution are the capabilities that are provided to consume data from on-premises network devices, which are superior to those of many competitors.

Cisco AppDynamics Cloud enables cloud-native applications to support next-generation modern applications with increased scale, automation, and ease of use. Data collection is based on OpenTelemetry, using standard collectors, including Cisco AppDynamics agents publishing to standard OpenTelemetry collectors, which allows the performance of application code, runtime, and behavior to be monitored.

Both Cisco AppDynamics and Cisco AppDynamics Cloud are single-product solutions. However, they can be enhanced with Cisco Secure Application for AppDynamics to provide runtime-based vulnerability and attack detection and protection via proactive policy-based remediation, Cisco ThousandEyes for network, internet, and endpoint performance monitoring, and Cisco Intersight Workload Optimizer for infrastructure monitoring and cost/performance optimization.

Business transactions, both known and unknown, are automatically discovered to create a detailed topology map of traffic flow within an application, displaying real-time user behavior. Traffic patterns are monitored and baselines of acceptable performance established. When important transactions slow down, diagnostic actions are automatically triggered to identify the root cause of any problem.

There are various sales models available: Infrastructure Monitoring Edition provides foundational infrastructure diagnostics; Premium Edition adds complete back-end monitoring; Enterprise Edition provides back-end and business performance monitoring; Enterprise Edition for SAP solutions provides code-level SAP visibility and performance monitoring; and the RUM component provides full visibility into the digital experience.

Strengths: AppDynamics has outstanding capabilities in inventory, for which its ability to consume data from on-premises network devices is enhanced by being part of Cisco, with the broad networking capabilities that it provides. In addition, its ability to monitor real-time user behavior means that issues can be speedily identified and resolved.

Challenges: There is some weakness in the areas of predictive analysis, where strengthening the AI/ML would provide better operational awareness.

Datadog

Datadog is headquartered in New York City, with regional headquarters in Boston, Dublin, Paris, Singapore, Sydney, Tokyo and offices across the US, Europe, and Asia Pacific. Founded in 2010, Datadog’s goal is to eliminate friction between developers and system administrators. Its growth is driven by a focus on automation and real-time observability. Launched as an infrastructure monitoring company, Datadog has expanded its portfolio via both acquisition and organic growth to offer solutions throughout the full observability space.

Datadog’s SaaS-based observability and security solution is a single platform for metrics, traces, logs, events, and security signals from across the stack, automatically enriched with contextual metadata. It includes application performance monitoring, infrastructure monitoring, log management, digital experience monitoring, network monitoring, and security. These products—over twenty individual modules—are tightly integrated and serve several cross-platform features such as dashboards, alerts, service-level objectives (SLOs), incident management, notebooks, and proactive and contextual ML capabilities.

Datadog provides a number of dashboards, including time-series, histograms, heatmaps, funnels, and service maps. Users can also customize dashboards, either manually or via application programming interface (API)/Terraform and can extend existing visualizations using Datadog’s App Platform or create their own dashboards. More than 600 integrations are included, each with its own out-of-the-box dashboards. PowerPacks leverage common patterns and internal expertise for new dashboards. Users can drill down through the metrics displayed to view individual events and entities. All dashboards can be shared and can be exported on a scheduled or ad hoc basis. Notebook- and widget-sharing capabilities allow graphs to be delivered within chat tools such as Slack and Microsoft Teams.

Datadog uses an agent to find and discover cloud-based infrastructure (in virtual, containerized, and serverless environments across all major cloud providers) and all types of application architectures (microservices, monolithic, and async systems). Mapping tools such as infrastructure and service maps can be used to provide a graphical representation of cloud fleets, and the network performance monitoring tool, which has an adaptive map, can be used to keep track of network performance across all detected resources.

The user interaction performance monitoring capabilities include RUM for web, mobile, and smart TV applications. Both RUM and synthetic tests are available and users can choose which business transactions to monitor.

Microservices detection can leverage a wide array of data, with Datadog’s Distributed Tracing solution working with its native libraries as well as with open standards like OpenTracing and OpenTelemetry. Additionally, a universal service monitoring tool leverages eBPF-based methods to detect running services and observe their performance. Users can also integrate with their configuration management database (CMDB) or manage service metadata via GitOps using Datadog’s Service Catalog, allowing users to register services that are not monitored by any tool.

Licensing is subscription-based, per host, by volume, and can be tailored to individual customers. However, the number of products necessary for a solution can make controlling costs difficult.

Strengths: Datadog has good capabilities across all key criteria areas, and can provide visibility across all infrastructures, allowing users to identify issues rapidly. The user interaction performance monitoring capabilities enable IT to assess how well end-user needs are being addressed.

Challenges: Datadog has a strong background supporting SMBs and is returning to large enterprises at the behest of its customers. With the large number of modules and the highly flexible licensing model, it can be difficult to determine costs.

Dynatrace

Dynatrace is a global company that was founded in Linz, Austria and is headquartered in Waltham, MA. Its solution combines observability, AIOps, and application security in a single platform, and has a solid reputation for high-quality APM. Dynatrace is now building on that reputation with its full observability platform based on Davis AI, the company’s proprietary AI engine. The product is targeted at the largest 15,000 global companies.

The system is designed to scale and operate in hybrid clouds, public clouds, or edge environments, as well as on-premises. However, the solution is a SaaS offering with agents deployed on-site and other environments. Dynatrace can be deployed at the edge in customer-provisioned infrastructure, which Dynatrace calls Dynatrace Managed.

The Dynatrace platform includes APM, AIOps, infrastructure monitoring, digital business analytics, digital experience management (DEM), application security, and automation capabilities for enterprise IT departments and digital businesses. Using automation in concert with the Davis AI engine, Dynatrace provides root cause details of application performance, generates insights into the underlying infrastructure, and presents an overview of the user experience. It provides automatic instrumentation with OneAgent as well as open data ingestion (for example, via Open Telemetry).

For inventory, Dynatrace finds, discovers, and monitors cloud-based infrastructure—including virtualization, containers, and servers—across multicloud and hybrid environments, combining agentless monitoring with its OneAgent approach. Auto-discovery of cloud services, container environments and Kubernetes, virtual machines (VMs), microservices, serverless functions, and hosts allows data to be ingested for cloud services, Kubernetes, SNMP, WMI, OpenTelemetry, AWS Distro, StatsD, Telegraf, Prometheus, and more. Dynatraces’s Davis AI engine provides answers to data that is contextualized in the unified data model Smartscape, which maps out all components along with their dependencies. Dynatrace provides support for more than 620 technologies out-of-the-box.

In November 2022, Dynatrace announced Grail, its new causational data lakehouse with a massively parallel processing (MPP) analytics engine. The technology enables log and event data management and analytics without storage tiering and indexing efforts. Grail unifies observability data, as well as security and business data, from cloud-native and multicloud environments for AI-powered answers and automation.

Automatic detection of microservices is supported by Dynatrace. Once detected, the microservices are enriched for trace analysis. Each business process and user session is monitored. OneAgent and OpenTelemetry data are combined using PurePath distributed tracing and code-level analysis technology, which enriches it with relevant data (such as user experience data) for end-to-end distributed tracing, code-level insights, continuous profiling, and metrics and log analysis.  Relationships between microservices and application components are automatically and dynamically detected, providing transactional insight into the contribution of each component of a request for business flow analysis. From this information, a continuously updated service map is created that includes services across clouds, which is used for analysis. Smartscape automatically maps and visualizes interactions and relationships between applications and connects services to the underlying infrastructure, updating in real time as microservices are added or disappear. Davis uses this information for auto-baselining, anomaly detection, automatic root cause analysis, and service optimization.

Learning to use Dynatrace may be difficult for smaller, less-technical organizations. Even for larger organizations, it may take users time to be completely comfortable with the solution. However, once learned, the solution is easy to use.

Dynatrace is licensed on a subscription basis, providing usage-based pricing across all modules, so costs vary greatly depending on the kind of coverage you need. The range of options provides flexibility, but it may make financial planning and cost management more difficult.

Strengths: Dynatrace is strong across all key criteria, in particular inventory, which allows the solution to locate and log all cloud-based infrastructure inventory, and microservices detection, which enables it to find microservices, including service buses and data services, without human intervention such as manual tagging (which is supported). The use of a single agent (OneAgent) allows for quick deployment.

Challenges: Dynatrace has a steep startup curve and may not be suitable for non-technical SMBs. Improvements in packaging and pricing would deliver increased flexibility for dynamic environments. Dynatrace has introduced Dynatrace Platform Subscription (DPS) to improve in this area.

Elastic

Elastic has a solid observability platform using the free and open Elastic Search Platform. The company has successfully layered usability and visibility on top of the stack, and its technology is used widely across enterprises of every size and vertical.

Elastic supports SaaS-based public cloud, private and public cloud, and private cloud deployments, offering both on-site and cloud (AWS, Azure, and GCP) versions. This enables users to create independent, hybrid cloud, or multicloud variations of the solution as needed. This flexibility is particularly useful when an enterprise needs to start at one location (either on-premises or in the cloud) and quickly expand to other locations without creating siloed implementations or fragmenting the toolset.

Elastic provides visibility across cloud-native infrastructure and applications, including services hosts, containers, Kubernetes pods, and serverless tiers. Anomalies can be identified and service dependencies mapped. More than 300 out-of-the-box integrations are provided for common services and platforms, and an intuitive UI provides visibility into all infrastructure and applications, from microservices to serverless architectures, to identify the root causes of issues.

Elastic uses agentless data ingestion leveraging native integrations within the cloud console, allowing users to import a variety of data such as logs, metrics, traces, and content from their own ecosystem, including applications, endpoints, infrastructure, cloud, network, and workplace tools.

Kibana provides ad hoc visualizations and analysis with support for dimensions, tags, cardinality, and fields. Attributes, hostname, IP address, and tags can all be used as search criteria. Real-time threat monitoring and executive summaries showing KPIs are just two of the types of dashboards that can be viewed, and all dashboards are interactive, allowing drill-downs. Data is displayed in panels as charts, tables, maps, and in a variety of other formats, enabling users to compare data side by side to identify patterns and connections. Several types of panels are supported, and a number of options are available for creating panels. Vega and Vega-Lite allow users to create custom visualizations without the need for JavaScript.

Users are able to collect, measure, and analyze performance data by URL, operating system, browser, and location  to assess how well applications and infrastructure are performing on end-user systems. Metrics can be viewed by individual pages or groups of pages. Both RUM and synthetic monitoring are supported. End-user experience can be simulated across multiple-step journeys with the ability to capture a detailed breakdown of status information and errors.

Strengths: Elastic has good capabilities across all key criteria, with extensive dashboard features through Kibana that provide visibility into all metrics, allowing users to quickly identify and rectify anomalies. User interaction performance monitoring provides early warning of issues that are impacting user experience and could result in lost opportunities.

Challenges: Elastic is based on open source code and Elastic Cloud can be deployed as a solution, which for shops without strong in-house technical expertise will present challenges.

Grafana Labs

Grafana Labs provides a unique approach to help customers on their observability journey by offering plug-ins to all major observability solutions (Datadog, New Relic, Dynatrace, and so forth), enabling customers to view all their observability data together in a single Grafana dashboard. In 2021, the company acquired performance-testing vendor k6, which extended its capabilities. For example, Grafana Cloud can now be used by developers to perform load testing on test scripts, and can visualize the results to troubleshoot issues and identify root causes.

Grafana Cloud is a composable observability platform, integrating metrics, traces, and logs with Grafana visualization. It takes advantage of open source observability software, including Prometheus, Loki, and Tempo, with no requirement to install, maintain, and scale the observability stack. Getting up and running, according to Grafana Labs, is extremely quick, requiring simple selection of the services to be monitored and installing the Prometheus-inspired agent in order to receive pre-configured alerts and dashboards. Grafana Cloud provides a fully managed service, and it includes a scalable, managed back end for metrics, logs, and traces.

Grafana Labs is a leading contributor to key observability open source projects such as Loki, Grafana, Tempo, and Mimir. The solution is API-driven and is also able to access data from commercial platforms such as Datadog, Dynatrace, MongoDB, Oracle, and SAP.

The Grafana Enterprise Stack includes Grafana, which unifies data sets across a company and allows DevOps and site reliability teams to view holistic business, application, and infrastructure health. It is built on a plug-in architecture with hundreds of connectors to popular tools, allowing users to interact with the underlying data sources in real-time without the need to create duplicate copies. It also includes solutions for preventing, finding, and acting on issues that could impact a company’s commitments.

Grafana Labs unifies existing data wherever it lives and allows it to be displayed in a single dashboard via APIs. Its first pane of glass allows users to visualize, alert, and correlate data. Dynamic dashboards can be created and shared with other users, enabling collaboration. Advanced querying and transformation capabilities are available to customize panels to create visualizations. Users can go deeper with component-level or a deep link into the data source.

Grafana is less capable in the area of user interaction performance monitoring, where it has limited synthetics and recently announced RUM capabilities. The front-end telemetry can then be correlated with back-end and infrastructure data for full-stack observability.

Grafana Labs offers both self-hosted and cloud-based options for its observability solutions. Licensing for Grafana Labs is flexible and depends on the number of active users, the services chosen, and so forth. However, the range of options available can make the predictability of costs difficult. Its roadmap includes several cost saving features due for release in 2023.

Strengths: Grafana Labs has good capabilities across most key criteria. A feature that differentiates it is its ability to provide visualizations on a single screen from multiple third-party applications organizations are using to gather data, without having to import that data.

Challenges: Grafana Labs is currently weak in terms of user interaction performance monitoring as it has limited synthetics capabilities. Although these are due to be addressed, it leaves users currently unable to adequately track end-user performance and identify issues that may be resulting in abandoned sales.

Honeycomb

Honeycomb is headquartered in California, US. It was founded in 2016 by software developers and infrastructure engineers to help engineering, DevOps, and site reliability teams operate more efficiently by understanding their complex and distributed systems.

The Honeycomb SaaS solution supports OpenTelemetry and ingests metrics, traces, and logs via the OpenTelemetry protocol (OTLP) or HTTP. Licensing is based on the number of events that are ingested. Honeycomb defines an event as a line of data that is sent as a JSON object posted to Honecomb’s API. The data can capture anything in a system that is worth tracking, including a trace span or structured log, including all context fields the application is instrumented to generate. A single event contains fields with unlimited cardinality and can include thousands of dimensions of context. In general terms, each user interaction is regarded to be an event, so the ability to drill down to the individual user experience is differentiated. All plans include an unlimited number of users, an unlimited number of services, unlimited data storage, and 60 days of data retention.

The company believes that it looks at data that is ingested in a different way from competitors. Instead of having an analysis loop that is linked to a particular framework or type of data, it treats data very agnostically. The solution includes a data store that Honeycomb built from the ground up that is designed to handle high dimensionality data, and it recommends that users send as much data as possible to Honeycomb with as many dimensions as possible, so it can be correlated with other data.

This approach benefits Honeycomb’s BubbleUp feature, which uses machine analysis to automatically detect commonalities in outlier data. Users can debug problems that impact end user experience like latency, traffic anomalies, error rates, or availability issues. BubbleUp allows users to highlight anomalies in their data by comparing a user-specified subset of events against a representative baseline across thousands of dimensions at any cardinality. It then surfaces the dimensions with the greatest differences as likely avenues for investigating the cause of issues creating problems for end users. By identifying performance bottlenecks and exposing the common patterns in the data, users now have the context to decide what is important, and then triage, identify, and resolve the issue.

Another feature that benefits from Honeycomb’s event-based columnar data store is its SLOs. Honeycomb SLOs express reliability goals over a rolling time window, giving teams an “error budget” to contextualize and prioritize issues. SLOs measure health by evaluating each event consumed by Honeycomb against a user-defined service-level indicator (SLI) to tie alerting directly to customer experiences. In addition to the current error budget burndown of a given SLO, Honeycomb users can drill into a detailed diagnostic view that provides that SLO’s historical performance, a heatmap of requests and their respective success or failure, and BubbleUp results comparing dimensions between passing and failing requests as detailed above. Teams can click through from this view into Honeycomb’s query builder to further diagnose any anomalies or dig into corresponding traces. This feature combines with customizable dashboards, shared query history, and filterable service maps to provide context across distributed apps and teams.

Strengths: Honeycomb has good capabilities for many key criteria, including inventory, multicloud resource view, microservices detection, and support for multiple public clouds. The BubbleUp feature provides a unique method for predictive analysis.

Challenges: Honeycomb is weak in dashboards and reporting with a setup that is not as simple as many others. Dashboards are queries to their multidimensional datastore. User interaction performance monitoring is also weak due to a lack of RUM or synthetics capabilities.

IBM Instana

IBM has adopted a cloud strategy that it believes will make it a leader in the hybrid cloud space. It acquired RedHat in 2019, bringing OpenShift into its stable which, in turn, improved IBM’s multicloud management platform. At the end of 2020, IBM acquired Instana, giving IBM an enterprise observability and application performance monitoring platform. Instana enhances IBM Cloud Pak for Watson for AIOps, providing a continuous stream of information that improves the quality of the recommendations from its AI models.

Instana’s full-stack APM discovery and monitoring includes automatic discovery, monitoring, root cause analysis, and feedback. In-depth root cause analysis of every incident is provided, with all events correlated using the Dynamic Graph. This results in the generation of a single alert, which contains a cause-and-effect report that includes hyperlinks to the details. Stream processing is used to collect Instana’s Dynamic Graph records, and relationships among all entities are modeled in real time, providing insights into interdependencies and the ability to identify what is not running at any time.

Instana APM has an agent architecture and sensors, which are mini-agents or small programs that monitor a single aspect, and are automatically managed by the single agent, which can be deployed either as a stand-alone process on the host or as a container via the container scheduler. The agent automatically detects physical components first, collects configuration data, and monitors it for changes. It also starts sending important metrics for each component every second.

Instana’s automatic discovery of microservices includes application services, platforms, and infrastructure, along with configuration and dependencies, with no requirement to change code or platform configuration. Real-time change detection of newly deployed and updated elements is also available. All service dependencies are automatically detected, mapped, and monitored for performance and issues, upstream and downstream. Automatic root cause analysis for microservices applications is included.

User interaction performance monitoring is supported through the ability to collect detailed browser performance data correlated to backend performance, with browser requests captured automatically. End-user experiences can be analyzed to identify performance issues in areas such as page views, resources, and HTTP requests.

Strengths: IBM has good capabilities across all key criteria, including microservices discovery, which helps organizations to optimize performance. Its user interaction performance monitoring allows organizations to optimize browser performance by identifying and resolving issues.

Challenges: Though an API interface does exist, Instana does not push information to other data sinks easily, and integration with other IBM tools is a work in progress. This makes it difficult to leverage the extensive library of solutions IBM can offer.

LogicMonitor

LogicMonitor provides an automated monitoring and observability platform targeted at enterprise IT and managed service providers (MSPs). It’s available in SaaS, on-premises, or hybrid deployment models.

LogicMonitor is built on OpenTelemetry and OpenMetrics. The single unified platform provides end-to-end tracing with code-level visibility across the entire stack, seamless data collaboration at scale, visibility into networks, clouds, containers, applications, servers, and log data, and AIOps for metrics, logs, and applications. Capabilities include APM, root cause analysis, anomaly detection, forecasting, alerting, and dynamic topology mapping (which allows dependencies and relationships to be visualized).

APM features include auto-instrumentation of client libraries for Java, Node.js, .NET, Go, and Python, and the ability to push or pull data from any source.

Agentless collectors automatically discover, map, and set baselines for complex and distributed infrastructure, with website monitoring and synthetics enabled. Coverage is provided for public (AWS, GCP, and Azure) and private clouds, multicloud environments, Kubernetes deployments, SaaS applications, and traditional environments. Advanced forecasting allows future trends to be predicted.

Dashboards and reports allow users to automatically map and visualize the relationships among microservices and application components to aid troubleshooting. Unknown impacts to application performance can be eliminated with public and private cloud and container monitoring without the need to write complex scripts. Dashboards are available for all aspects of the system—providing high-level KPIs for business insights down to granular technical metrics—covering infrastructure performance, application status, and centralized metrics for cloud, hybrid, and virtualized infrastructures. Prebuilt dashboards are provided, but dashboards can also be created or customized by users. Dashboards can be organized into groups; access can be shared, or it can be controlled through access rights and permissions.

AI/ML is used in the auto-detection of anomalies and to adjust thresholds dynamically with continuous unsupervised learning and automated root cause analysis and alert suppression to avoid alert storms.

The data forecasting feature, an AIOps tool, predicts future trends for the monitored infrastructure based on past performance. Anomalies and missing data are identified and removed, and then a capacity trending algorithm is applied to the sample to find the best fitting model for the collected data to make the prediction. Data forecasting visualization is available from any graph in the LogicMonitor interface.

Strengths: LogicMonitor scored well on all key criteria, especially for its ability to create and customize dashboards and the variety of its out-of-the-box dashboards. It is also able to provide forecasts of future trends.

Challenges: LogicMonitor Cloud is weak on network monitoring and bandwidth utilization. The ability to schedule reports is limited.

Logz.io

Logz.io is an Israeli-based company with a large presence in the US. It primarily uses open source technologies and open standards (such as OpenTelemetry) to monitor, log, collect, search, and analyze observability data. Logz.io works well with agile cloud-native customers, most of whom are running Kubernetes in production. It supports a SaaS-based public cloud model.

The scalable Logz.io platform has four elements:

  • Elasticsearch, Logstash, Kibana (ELK)-based log management
  • Prometheus Grafana-based infrastructure monitoring
  • Jaeger-based distributed tracing
  • ELK-based cloud security information and event management (SIEM)

These are fully managed, integrated cloud services for effectively monitoring, troubleshooting, and securing distributed cloud workloads. While the logging solution has been around since 2014, the tracing and infrastructure components were added in 2020. The vendor also recently released a synthetic monitoring system using function as a service (FaaS).

Logz.io provides solutions for a number of use cases, including AWS and Azure observability and container monitoring, with the ability to monitor Docker and Kubernetes using a unified machine data analytics platform built on top of the ELK Stack and Prometheus.

Human-coached AI/ML examines online forums such as StackOverflow and GitHub Log Patterns to determine which error and exception logs are most important to engineers. Real-time alerting based on logs, metrics, and traces is available, and alerts can be duplicated, tested, and sent to any endpoint.

Dashboards are available using Kibana, and existing Grafana dashboards can also be migrated to Logz.io. Users can also build, customize, and monitor data visualizations using PromQL with the ability to drill into interesting trends with filters and a drag-and-drop interface. OpenSearch Dashboards are another option for investigating large volumes of data. Filters, search phrases, and a date picker or relative time range selector are all available to help locate the required logs.

Support for microservices is also provided, with the ability to see how microservices are interacting with each other and supplying the information required to troubleshoot issues and improve performance. Users are able to view and drill down into end-to-end call sequences of selected requests of intercommunicating microservices.

Logz.io does less well with user interaction performance monitoring and predictive analysis, due to weak AI capabilities.

Strengths: Logz.io has good capabilities across some key criteria, including dashboards and reports for which the number of options for creating visualizations provides flexibility for users. The solution also has separate options for AWS, Azure, and GCP. GCP is a new addition since the last Radar.

Challenges: Logz.io is weak in some areas, particularly in its support for microservices detection, which is not as extensive as what other vendors offer. It also lacks capabilities in user interaction performance monitoring, and in predictive analysis, where it would benefit from stronger AI capabilities.

Microsoft

Microsoft is a well-established APM vendor, having acquired BlueStripe Software, a provider of application management technology, in 2015. In 2017, Microsoft merged Operations Management Suite (OMS) and Application Insights to form the basis of Microsoft Azure Monitor. BlueStripe’s solution was integrated into the new offering, adding application-aware infrastructure performance monitoring and enabling Azure Monitor to map, monitor, and troubleshoot distributed applications across heterogeneous operating systems and multiple data center and cloud environments.

Azure Monitor collects, analyzes, and acts on telemetry data from Azure and on-premises environments—including Azure Kubernetes Service (AKS), Azure Storage, databases, and Linux and Windows VMs—in a single map. Integration with other Microsoft applications such as PowerBI, extends the SaaS-only solution’s capabilities.

Curated visualizations, reports, and diagnostic tools are available, such as: application insights for application performance monitoring; VM, container, and network insights for infrastructure monitoring; and analytics for deep diagnostics.

  • Application Insights enables users to detect and diagnose issues across applications and dependencies.
  • VM Insights and Container Insights monitor the performance of VM and container workloads, collecting metrics from controllers, nodes, and containers that are available in Kubernetes.
  • Log Analytics provides troubleshooting and deep diagnostics by allowing drill down into monitoring data. Smart alerts and automated actions are both supported.

Azure dashboards allow different types of data to be displayed in a single pane in Azure Portal. Dashboards can be customized and user-created. Azure Monitor Metrics allows data to be collected from monitored resources.

One-click access to features such as search and analytics is provided. Transactions are correlated end-to-end from apps and dependencies to infrastructure. Built-in topology maps for applications, VMs, and networks are included, as are OpenTelemetry-based vendor-agnostic tracing capabilities. A centralized log platform provides access to long-term storage for analysis.

Azure Monitor detects microservices and can collect and view metrics in AKS clusters and other dependent Azure services. Memory and processor metrics are collected from controllers, nodes, and containers via the Kubernetes metrics API. Application metrics are collected with Application Insights, which monitors an app and sends telemetry data to the Application Insights service. Telemetry data can also be brought in from the host environment, and is then sent to Azure Monitor. Built-in correlation and dependency tracking is also available. A maximum events per second throughput exists for Application Insights, and if this limit is exceeded, it throttles the capture process.

Azure Monitor supports mutlicloud environments. It fully supports VMs and Kubernetes clusters on other clouds, private clouds, and on-premises servers. Application Insights supports instrumentation of applications in and the environment. Azure Monitor misses some vendor-specific functionality in multicloud such as specific AWS and GCP audit logs and non-standard platform as a service (PaaS) solutions.

Strengths: Microsoft has good capabilities across most key criteria. Azure Monitor allows data to be collected from a wide range of sources, providing users with the information needed to ensure that issues do not impact performance. Dashboards and reports allow users to customize and share the metrics they require.

Challenges: Microsoft has limitations in its support of multicloud environments, which limits its applicability. Although many organizations are Microsoft-centric, and may have the majority of their applications running in Azure, they may still have some applications running in other clouds, and the lack of support for multiple clouds would mean having to run multiple observability solutions, no doubt deterring some from using the Microsoft solution.

NetApp

Founded in 1992, NetApp is a hybrid cloud data services and data management company with over 30 years in the industry. It provides a number of solutions and cloud services to help companies manage their IT infrastructures.

NetApp Cloud Insights is a SaaS monitoring tool (available standalone or as an integrated service of the NetApp BlueXP control plane) that allows complex infrastructures to be mapped and provides real-time data visualization of the topology, availability, performance, and usage of the entire IT infrastructure, including cloud and on-premises environments. It provides an understanding of demand, latency, errors, and saturation points of all services. Automated discovery allows end-to-end service paths to be created. Root cause data is provided to identify performance level violations.

The NetApp approach is to handle the hardware up to the point of the application. Data can then be consumed and displayed from other sources. Predictive analytics, based on ML technology, provides alerts of potential issues before they escalate to become major problems. The tool operates across hybrid and multicloud environments, providing detailed and summary reporting that can be used for cost optimization and workload planning.

Other NetApp products, including Active IQ, integrate into the Cloud Insights UI to provide a single view of the environment, allowing potential trouble spots to be identified and cloud cost optimizations to be determined.

Cloud Insights includes an extensive library of standard and customizable dashboards and reports. Intelligent Insights provides preconfigured dashboards that predict and warn of potential problems while providing steps to resolve the issues. All dashboards and reports are customizable, can be created by users, and are added to the library for use by others with access. Reports can be scheduled or run ad hoc for local consumption or delivered to a configurable distribution list, which can be created for each report.

Predictive analysis is supported through Intelligent Insights, which can predict potential issues, such as misconfigurations, security vulnerabilities, storage time-to-full, and shared resources underload. Recommendations and/or automated solutions are provided to resolve these issues, and resolutions can thus be applied before problems impact customers. Thresholds can also be set for any metric or combination of metrics that generate alerts warning of potential problems.

In terms of inventory, Cloud Insights discovers on-premises and cloud-based compute, storage, and traffic infrastructure, including heterogeneous storage, VMs, and servers, as well as those with microservices and containers. Graphical representations show end-to-end data flow, allowing the speedy resolution of operational infrastructure issues. Real-time traffic statistics and performance metrics, stored in the data lake, can be used to ensure performance goals are met and measured against other compute and storage metrics.

Strengths: NetApp has outstanding capabilities in the area of dashboards and reporting, enabling users to view information in real time, both through prebuilt reports and customized or user-created reports. Predictive analysis is another very strong area, which allows users to anticipate and resolve issues before they impact users. NetApp’s ability to find ransomware threats is a standout among observability solutions.

Challenges: User interaction performance monitoring is not a current focus of the NetApp BlueXP and Observability Suite. This is a significant omission because it does not provide insights into how user transactions are being handled, which is an important element of observability. It is also weak in application performance monitoring.

New Relic

Founded in 2008, New Relic is headquartered in San Francisco, California. Its observability platform, New Relic, targets all enterprise verticals including technology, retail, finance, healthcare, media, industrials, and the public sector, focusing on forward-thinking organizations looking for innovative solutions to their problems.

New Relic is a cloud-based observability platform, packaged and priced as an all-in-one solution, meaning any user with a full license can access the full platform. The vendor deploys consumption-based pricing. The default data retention period is 30 days for the standard data offering. Customers can extend retention by upgrading to Data Plus which includes up to 90 days of extra retention for most data types over the default. Data is stored in New Relic’s US data center by default. Customers can also choose to store data in the EU data center for an additional charge.

New Relic provides APM as well as infrastructure, browser, real user, synthetics, mobile, AIOps, and native client monitoring. It includes a unified telemetry solution that supports high-cardinality data from a range of sources including New Relic agents, open source solutions, and proprietary data sources spanning metrics, events, logs, and traces.

New Relic provides full-stack visibility from the client side (mobile, browser) to back-end services, to databases, infrastructure, and networks, with the ability to view traces and logs in context. Automap capabilities allow dependencies to be visualized. A real-time Java profiler enables troubleshooting cluster behavior to diagnose and improve performance by reducing bottlenecks.

An area of strength for New Relic is dashboards and reporting. Live dashboards are available that show telemetry in real time, or capture a specific moment in time. A time picker controls all data visualizations, enabling users to view live data or data within a specific timeframe. A variety of prebuilt reports are provided that include APM, infrastructure monitoring, browser, and Kubernetes. Dashboards can also be customized and users are able to share reports via a permalink or by downloading them as an image, PDF, or CSV file, if applicable. Default chart customizations are also available, such as date and time formats, axis customization, and interactive legends. Users can import, query, and create custom dashboards and reports on all their telemetry data.

New Relic provides browser- and real-user-monitoring capabilities by allowing the tracking of front-end application performance to show an application’s best and worst performing pages. Insights into the geographic locations of real users, the operating systems and types of devices they use, and how well they perform are also available. Different types of synthetic monitors are available for both internet- and intranet-based applications, including broken-links monitors, certificate-check monitors, ping monitors, step monitors, simple browser monitors, scripted browser monitors, and API tests.

Ease of use and security are also outstanding areas for New Relic, which sports a new and consistent UI, and has improvements in the area of integrations. The vulnerability manager is also a new feature, with the ability to connect to NIST (National Institute of Standards and Technology) when needed.

Strengths: New Relic has outstanding dashboards and reports capabilities that enable users to view data in real time to see how resources are performing. It also does well on all of the other key criteria, including user interaction performance monitoring, an area where New Relic has always been a leader.

Challenges: The default data retention period is 30 days with an extension of 90 days at an additional cost. This could be an issue for enterprises looking to understand long-term trends. In addition, while there is support for S3 data storage, it is not a transparent solution.

OpenText (Micro Focus)

Founded in 1976, Micro Focus is one of the longest-running players in the IT operations management (ITOM) monitoring space, known for solutions in DevOps, hybrid IT, security and risk management, and predictive analytics. Through a number of acquisitions (HP Software and Vertica among them), Micro Focus tried to expand into the IT observability space. Then, in January 2023, Micro Focus was itself acquired by OpenText. The acquisition gives OpenText, a vendor best associated with providing information management solutions, an entry into the observability and AIOps market.

OpenText (Micro Focus) supports public cloud (SaaS), private cloud (on-premises), and public and private cloud (hybrid) deployment scenarios.

The Operations Bridge product automatically monitors and analyzes the health and performance of multicloud resources across devices, operating systems, databases, applications, and services for all data types. The platform offers an event consolidation and correlation engine, and big data analytics-based noise reduction. It integrates end-to-end service awareness with rule- and ML-based event correlation capabilities delivered on top of an OPTIC data lake.

Operations Bridge automates the discovery of cloud resources and services across multiple infrastructures and clouds including AWS, Microsoft Azure (including Azure Stack), Google Cloud environments, and private clouds.

A SaaS-based AIOps platform consolidates data across toolsets to pinpoint service slowdowns and solutions, providing automated event and metric analyses. Integrated ML on events and data automatically provides problem identification with real-time automated event correlation and dynamic thresholds, along with interactive visual analytics.

Dashboard and reporting capabilities are provided using a single data store with real-time business value dashboards or, alternatively, companies can use their own BI tool of choice. Business value dashboards (BVDs) display traditional status and KPI data from Operations Bridge and other IT sources. BVDs also share metrics with the OPTIC Data Lake Collect Once, Store Once (COSO) common data store.

Predictive analytics using ML creates dynamic baselines that automatically incorporate prior history and seasonality. Created events can alert operators to help identify issues when thresholds are broken before overall systems are impacted.

Strengths: OpenText (Micro Focus) has good capabilities across most key criteria, particularly in its dashboard and reporting features, which allow users to use their BI tool of choice. Use of ML in predictive analysis helps to reduce issues that impact performance.

Challenges: OpenText’s (Micro Focus)’s multicloud capabilities could be stronger. Moreover, due to the recency of the acquisition, there are many unknowns still regarding the fate of the Micro Focus portfolio. OpenText will need to both communicate and demonstrate how it intends to manage Micro Focus products going forward.

SolarWinds

Founded in 1999, SolarWinds is headquartered in Texas, US. It develops software to help manage networks, systems, and infrastructure, and provides solutions to handle observability, IT service management, application performance, and database management. The observability product provides a full-stack solution that monitors on-premises and multicloud environments, with native support for AWS and Azure clouds, increasing visibility, intelligence, and productivity.

SolarWinds’ observability suite is offered via one unified platform that enables businesses to optimize their application and system performance, ensure availability, and reduce remediation time across on-premises and multicloud environments. It connects data from web applications and their services, cloud, and hybrid infrastructure, including Kubernetes, AWS, Azure, databases, networks, and from end-user experiences to deliver business insights and operational intelligence.

The SolarWinds Platform Connector is a private data fabric that enables interwoven full-stack observability across cloud-native, multicloud, hybrid, and on-premises environments. It employs integrated AIOps and automation capabilities to accelerate detection and resolution.

There are two versions of the platform: SolarWinds Observability is a SaaS solution, and SolarWinds Hybrid Cloud Observability is a self-hosted solution that can be deployed on-premises or in a public or private cloud, with a path to the cloud through integration with SolarWinds Observability.

SolarWinds has good capabilities across all key criteria. Its inventory capabilities enable the discovery of cloud services, including VMs, containers, and services. For Azure or AWS cloud accounts, agentless integration with the cloud provider is available. On-premises servers and network infrastructures are served by agents or collectors to discover and observe the relevant entities. Discovered services and network topologies are mapped and visualized.

Microservices are automatically discovered in instrumented environments, including public cloud accounts and Kubernetes clusters. The APM module shows service topology, its components, and dependencies for discovered services. However, its capabilities will be improved when the company adds deeper integration with CI/CD tools, which is on the roadmap.

The SolarWinds platform supports a number of clouds. Deep integration is provided with AWS and Azure. Private cloud environments, as well as resources in Google Cloud Platform (GCP), Oracle, and Heroku environments, can also be monitored using SolarWinds agents and/or collectors.

For predictive analysis, historical data, multidimensional baselining, and predictive analytics are used to provide recommendations for fixing current and predicted issues through automated actions.

Strengths: SolarWinds has good capabilities across all key criteria. Its inventory capabilities and microservices and cloud support help organizations to locate and log all cloud-based infrastructure across their entire environment, allowing all components to be monitored. This coverage helps to ensure high availability of systems and applications.

Challenges: SolarWinds does not currently support identification of shadow changes. However, it has plans to offer solutions in this space in the next 18 to 24 months under the security observability umbrella, bringing in technologies such as attack surface management, application security posture management, and runtime application self-protection.

Splunk

Splunk has been in the IT monitoring business for more than 15 years. In 2019, it acquired SignalFx (a SaaS-based monitoring and analytics platform) and Omnition (a SaaS-based tool for monitoring microservices applications), which enhanced the usability of the Splunk platform and transformed it into a full observability platform.

The Splunk solution combines monitoring, troubleshooting, and incident response solutions that boost application and IT modernization initiatives. Splunk is a full-stack multicloud integrated enterprise solution that comprises Splunk Observability Cloud and Splunk Enterprise and brings together infrastructure monitoring, application performance monitoring, digital experience monitoring, real-user monitoring, synthetics, log investigation, AIOps, and incident response into a single platform for any hybrid cloud application environment.

Splunk Observability Cloud is SaaS-based and enables full-stack visibility across infrastructure, applications, and business services, providing insights to ITOps, DevOps, CloudOps, SREs, application developers, and service owners. It provides the monitoring services noted above, as well as log analysis, incident response, on-call management, and mobile alerting and dashboarding. Splunk Enterprise can be consumed through a SaaS model (Splunk partners with several cloud providers and does not maintain any private network) or through an on-premises deployment.

Preconfigured dashboard templates are provided for technical operators and practitioners, which showcase real-time infrastructure, application performance, and security metrics across the full stack. Preconfigured executive dashboards displaying metrics such as service availability status, number of contracts being processed, and how much revenue certain services are generating, are also provided to show business KPIs and business context. Dashboards and reports are fully configurable to visualize specific layouts and data. Customizable views such as Time Chart, Table, and Single View are available and users can create custom panels (reusable components) that can be added to dashboards and reports.

Splunk’s inventory capabilities include an OpenTelementry collector that integrates with cloud services including Amazon Web Services, Microsoft Azure, GCP, containers, hosts, and Kubernetes environments and with service discovery that detects each application and service within the environment. The collector ingests metrics, events, and dimension properties and then automatically configures appropriate integration plug-ins and sends the data to Splunk Observability Cloud. Visibility into the health of containers, hosts, Kubernetes, and cloud-based services such as Kafka and RDS are also provided. Infrastructure metrics are connected to logs, and recommendations about which logs to investigate in poorly performing services are made.

Strengths: Splunk has good capabilities across all key criteria, which enable organizations to monitor and report on all areas of infrastructure across multiple clouds. This enables issues to be identified and potential problems to be predicted, ensuring that they are resolved before impacting performance.

Challenges: Splunk is a large and complex platform. Determining what to use and becoming proficient using the tooling can take time.

StackState

StackState was founded in 2015 and is headquartered in the Netherlands, although it has an office in Boston, Massachusetts. This nimble startup has built its observability data analytics solution from the ground up, offering some unique features that have found niche applicability with banking and finance, telecoms, and MSPs, mostly in Europe.

StackState is a topology-powered and relationship-based observability solution. A single product, it maps business services to their applications, infrastructure dependencies, configurations, and changes. The topology relationships are generally derived from production through the use of eBFP and reflect continuously the most accurate state of the landscape. An additional business layer can be added on top, and this data can come from ServiceNow or other IT management tools. StackState collects data by integrating with third-party monitoring tools, such as Splunk, and can be extended with the platform’s own agents. The SaaS offering ingests data directly from Kubernetes, AWS, and many other cloud integrations to offer cloud-native teams the ability to build an understanding of their cloud-native applications. There is support for AWS Serverless Monitoring, through which OpenTelemetry collects traces and creates the topology.

StackState provides connected insights that deepen human understanding of the environment, based on exceptional collection and correlation of data. StackState shows the impact of issues on the business, identifies the cause of issues accurately, and encodes expert practices out of the box to quickly remediate issues. With StackState, any engineer can ensure smooth operation of applications and services, even if they lack specific knowledge of the application, service architecture, or the underlying infrastructure. SaaS and on-premises deployment options are available.

Out-of-the-box dashboards are automatically generated based on the components being investigated. Manual dashboards can also be created to deal with other metrics, such as tracking business metrics, and Grafana dashboards can be added on top using PromQL.

Multicloud resources can be viewed together in a single automated dashboard and individual resources can be viewed there as well.

Microservices, the processes supporting them, and the pods and containers running them, are automatically detected along with the traffic transiting among all services. This information is used to generate dynamic service maps that include links to the entire infrastructure supporting them. This smart mapping is achieved using the eBDF-based agent, with support for any programming language. Also included are links to CI/CD to make the changes that are triggered as a result of the issues identified.

However, StackState lacks capabilities in the area of user interaction performance monitoring, where it supports business transactions but only has basic RUM and synthetic monitoring.

Strengths: StackState has good capabilities across most key criteria, enabling it to support most of the requirements of organizations in terms of observability. For example, its automatically generated dashboards allow users to see performance metrics in real time.

Challenges: StackState lacks interaction performance monitoring capabilities, providing a more limited view of user experience, and therefore can’t act quickly enough when user experience is impacted.

Sumo Logic

Sumo Logic was built originally as a log-management, big data analytics, and SIEM solution, but the company has added tracing and metrics to revamp the product into a full observability platform. Sumo Logic Observability is a cloud-native, multitenant SaaS solution hosted on AWS and available in multiple regions, which is licensed as a single product.

Sumo Logic Observability provides cloud-native transactional intelligence for distributed business workflows, by analyzing logs, metrics, and traces in real-time with automatically generated application topology. Sumo Logic supports open standards for data collection such as Telegraf for metrics and OpenTelemetry for tracing, as well as others like Prometheus, FluentD, and FluentBit for other types of telemetry, and legacy open standards from the Cloud Native Computing Foundation (CNCF). The Sumo Logic Continuous Intelligence Platform ingests and analyzes data from applications, infrastructure, security, and IoT sources, then develops unified, real-time analytics. The platform employs AI/ML to create a smooth user experience for exploring logs, metrics, and traces.

Sumo Logic provides full support for AWS, Azure, and GCP environments, applications, and services, and integrates with cloud monitoring tools. Built-in pattern detection using ML, anomaly detection, outlier detection, and predictive analytics provides insights and a way to help locate root causes. There are native integrations for Kubernetes, Docker, AWS EKS, AWS Lambda, Azure AKS, and Azure functions.

User interaction performance monitoring capabilities include RUM, which leverages JScript. Full support for all modern web frameworks, including SPA applications, collecting Core Web vitals KPIs and error logs from browsers, is provided by the Sumo Logic distribution of OpenTelemetry JS instrumentation. A scripted transaction against on-site and web-based applications can be executed by synthetic monitoring from the Sumo Logic Catchpoint integration. Test results are stored in the Sumo backend where they are available for analysis together with other APM KPIs on the same dashboards. Business transactions can be displayed in a number of ways: as a software service; a collection of software services organized in an application, connected by their dependencies; a funnel diagram explaining multiple-step transaction conversion rates; or an end-to-end trace (from device to the database) for individual executions of the transaction. Code-level tracing is available for any stack supported by OpenTelemetry.

A Sumo Logic app is a collection of prebuilt dashboards and searches that provide insights into the operation and/or security posture of third-party solutions. Sumo Logic’s app catalog has more than 100 such applications. In addition to prebuilt reports, users can customize and share their own reports.

Strengths: Sumo Logic has good capabilities across all key criteria, especially in user interaction performance monitoring. These capabilities help organizations to monitor the end-user experience through RUM, synthetic monitoring, and transaction tracing to assess how well user needs are being met and accordingly, the performance value of the application being used.

Challenges: Sumo Logic should extend its dashboard and reporting capabilities to allow scheduled reports to be distributed to specific users automatically. This would cut down on duplication of effort and also enhance productivity by ensuring that users have speedy access to the reports they require.

VMware

VMware Aria Operations for Applications (formerly branded Tanzu Observability by Wavefront and a part of the VMware Tanzu portfolio) supports cloud, hybrid cloud, and containerized applications. VMware is expanding its support for cloud and Kubernetes, and this platform is designed to help produce, maintain, and scale cloud-native applications. The solution can be deployed in a public cloud (SaaS), in private and public clouds, or in a private cloud.

Aria Operations for Applications is designed specifically to help enterprises with the monitoring, observability, and analytics of cloud-native applications and environments, including AWS, Azure, and GCP. It uses metrics, traces, histograms, span logs, and events. These are aggregated across distributed applications, application services, container services, and public, private, and hybrid cloud infrastructures to build a real-time picture of an entire ecosystem. More than 250 vendor integrations are provided.

The solution also provides observability into Kubernetes environments, auto-discovers Kubernetes workloads, and recognizes Kubernetes services. It populates out-of-the-box dashboards with metrics from all Kubernetes layers, including clusters, nodes, pods, containers, and system metrics.

Microservices-based applications can be monitored using built-in support for key health metrics, histograms, and distributed tracing and span logs for common languages and frameworks. All of these elements are unified in a single platform. Support is provided for OpenTelemetry-compliant solutions. Application maps display dynamic distributed application services in real time, with a drill-down capability allowing access to root causes. AI Genie uses ML-based anomaly detection and forecast prediction to provide visualization of incidents and future requirements across applications and infrastructure.

Many out-of-the-box dashboards are provided and users can customize dashboards or create their own. Drill-down is enabled and dashboards can be shared, but users need dashboard permission and modify access to save changes made to dashboards. Single- or multiple-chart dashboards are available and can be created in a number of ways including using a New Chart widget, following a wizard, or using the Templates widget and selecting an integration. Dashboards can be set to a specified time window or can be run in live mode.

Predictive analysis is supported with the ability to predict behavior from existing data with high seasonality, then generate an alert when a future problem is foreseen.

Strengths: VMware dashboards and reports show good capabilities with a variety of ways available to create dashboards. The solution is also able to detect microservices.

Challenges: Broadcom has announced its intent to purchase VMware. The lack of certainty in the future of VMware’s cloud observability offering, which competes with Broadcom’s, may make VMware a difficult choice for an observability solution.

6. Analysts’ Take

The cloud observability marketplace has seen many changes in the past few years. Initial offerings were modeled on application performance monitoring tools, with additional features adding multicloud, inventory management, and user performance monitoring. The early market leaders that have remained—Dynatrace, New Relic, and Cisco AppDynamics—trace their roots to APM, and they have been joined by several other vendors that have added functionality to compete. The majority of the solutions within the Leaders circle are platforms. Some—such as Datadog, SolarWinds, Splunk, Sumo Logic, Micro Focus, and Broadcom—have continued to add features and functionality, making their solutions even more compelling.

The current Radar shows that most players within the Leaders circle are capable of satisfying the needs of the majority of enterprises. The methods used to reach the same results may be different, and the marketing arms of each vendor will be swift to point out the distinctions between them. However, on the whole, all the Leaders can achieve similar results. Vendor selection will depend on the type of licensing available, professional services needed, the number of tools displaced, and the deployment and startup effort involved.

There are solutions with unique strengths that may show weaknesses in other areas. For example, NetApp provides ransomware detection based on file and user behavioral analytics while offering only a weak application performance suite. StackState provides an index-less multidimensional datastore with its own SQL-based query language that may provide true innovation as the product matures.

In general, the market has converged on the key criteria. The future will hold many surprises, and next year, the rating and rankings may be quite different as new functionalities are added. Specifically, security features, more robust and capable AI/ML functionalities, better user experiences, and a more seamless integration with DevOps are likely to differentiate these vendors in the future. The ability to find and notify on shadow (also known as “ghost”) changes will become important as the toolchains for application development and observability begin to merge.

Choosing a cloud observability solution becomes an exercise in finding what fits now without compromising the future. If total tool replacement is required, look carefully at the vendors in the Leaders circle. Their solutions provide great capabilities, though with a flexibility and complexity that can complicate their selection. Enterprises whose needs are more specialized may want to look at more innovative solutions, while shops with strong technical skills that are well versed in open source technologies should focus more on solutions positioned on the left half or Feature Play side of the Radar.

This year, there is a bit of conformity to the solutions. However, the next 12 to 18 months are likely to introduce new capabilities that will provide more differentiation than ever before.

7. Methodology

For more information about our research process for Key Criteria and Radar reports, please visit our Methodology.

7. About Ron Williams

Ron Williams is an astute technology leader with more than 30 years’ experience providing innovative solutions for high-growth organizations. He is a highly analytical and accomplished professional who has directed the design and implementation of solutions across diverse sectors. Ron has a proven history of excellence propelling organizational success by establishing and executing strategic initiatives that optimize performance. He has demonstrated expertise in planning and implementing solutions for enterprises and business applications, developing key architectural components, performing risk analysis, and leading all phases of projects from initialization to completion. He has been recognized for promoting effective governance and positive change that improved operational efficiency, revenues, and cost savings. As an elite communicator and design architect, Ron has transformed strategic ideas into reality through close coordination with engineering teams, stakeholders, and C-level executives.

Ron has worked for the US Department of Defense (Star Wars initiative), NASA, Mary Kay Cosmetics, Texas Instruments, Sprint, TopGolf, and American Airlines, and participated in international consulting in Qatar, Brazil, and the U.K. He has led remote software and infrastructure teams in India, China, and Ghana.

Ron is a pioneer in enterprise architecture who improved response and resolution of enterprise-wide problems by deploying “smart” tools and platforms. In his current role as an analyst, Ron provides innovative technology and strategy solutions in both enterprise and SMB settings. He is currently using his expertise to analyze the IT processes of the future with particular interest in how machine learning and artificial intelligence can improve IT operations.

8. About Sue Clarke

Sue Clarke has worked as an industry analyst for almost 25 years, supplying research, analysis, and advisory services in the content management space to both organizations and vendors. She has built up a wealth of knowledge and experience having spent more than 20 years focusing on enterprise content management (ECM) in areas including document management and collaboration, records management, enterprise file sync and share, search, content analytics, case management/business process management, capture and scanning, e-discovery, web content management, digital asset management, web analytics, and customer communications management.

9. About GigaOm

GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.

GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.

GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.

10. Copyright

© Knowingly, Inc. 2023 "GigaOm Radar for Cloud Observability" is a trademark of Knowingly, Inc. For permission to reproduce this report, please contact sales@gigaom.com.