Table of Contents
As an enterprise discipline, data observability provides prolonged, up-to-the-moment status updates about the overall health of an organization’s data. It reinforces data health so data is available, accurate, comprehensible, governable, and ultimately usable for the numerous data-centric applications, operational systems, and organizations that depend on it.
Data observability is critical for countering, if not eliminating, data downtime, in which the results of analytics or the performance of applications are compromised because of unhealthy, inaccurate data. Well-implemented solutions in this space are also useful for detecting and mitigating the effects of data drift, which occurs, for example, when data powering predictive models in production begins varying in shape or statistical profile from that in training settings, skewing data science efforts. Downstream impacts of these and other effects of poor data health include inordinate delays, difficulty complying with regulations, churn, and increased litigation and penalties.
As their names suggest, there are obvious similarities between data observability and classic observability, the latter of which is closer to application performance monitoring. However, data observability is almost solely focused on the state of data itself, as opposed to the logs, traces, and metric information that is central to observability about applications and systems.
That said, it’s worth noting that many developments fueling innovation in data observability seem to borrow constructs from the classic form—yet tailor them to a data-driven focus. There are a wealth of AI techniques that can infer the state of data, monitor it, deliver alerts about issues, and remediate problems. One of the benefits of data observability vendors is they employ these and other measures for data whether it’s at rest, in-motion, behind enterprise firewalls, or anywhere in hybrid and multicloud environments.
Of additional value is the fact that data observability applies to data at a granular level. This discipline supports inferencing, monitoring, alerting, and corrective capabilities for schema particulars and the data quality dimensions of completeness, uniqueness, timeliness, validity, and others. These mechanisms are equally adept for data in pipelines or repositories, and they rely on cutting-edge dashboards and visualizations that show the state of data in ways that are understandable even to nontechnical users. This area also utilizes both predictive capabilities and detailed root cause analysis via data lineage constructs that reveal where breakdowns occurred–and how to prevent them from reoccurring.
This GigaOm Radar report highlights key data observability vendors and equips IT decision-makers with the information needed to select the best fit for their business and use case requirements. In the corresponding GigaOm report “Key Criteria for Evaluating Data Observability Solutions,” we describe in more detail the capabilities and metrics that are used to evaluate vendors in this market.
All solutions included in this Radar report meet the following table stakes—capabilities widely adopted and well implemented in the sector:
- Data profiling
- Dashboards and data visualizations
- Data lineage
- Continuous monitoring and alerting
How to Read this Report
This GigaOm report is one of a series of documents that helps IT organizations assess competing solutions in the context of well-defined features and criteria. For a fuller understanding, consider reviewing the following reports:
Key Criteria report: A detailed market sector analysis that assesses the impact that key product features and criteria have on top-line solution characteristics—such as scalability, performance, and TCO—that drive purchase decisions.
GigaOm Radar report: A forward-looking analysis that plots the relative value and progression of vendor solutions along multiple axes based on strategy and execution. The Radar report includes a breakdown of each vendor’s offering in the sector.
2. Market Categories and User Segments
To better understand the market and vendor positioning (Table 1), we assess how well data observability solutions are positioned to serve specific market categories and user segments.
For this report, we recognize the following market segments:
- Small-to-medium business (SMB): In this category, we assess solutions on their ability to meet the needs of organizations ranging from small businesses to medium-sized companies. Also assessed are departmental use cases in large enterprises where ease of use and deployment are more important than extensive management functionality, data mobility, and feature set.
- Large enterprise: Here, offerings are assessed on their ability to support large and business-critical projects. Optimal solutions in this category will have a strong focus on flexibility, performance, data services, and features to improve security and data protection. Scalability is another big differentiator, as is the ability to deploy the same service in different environments.
- Specialized: Optimal solutions will be designed for specific workloads and use cases, such as big data analytics and high-performance computing (HPC).
In addition, we recognize three user segments for solutions in this report:
- Data engineers: Data engineers are the most technologically advanced user subset of data observability platforms. This group is often responsible for the various transformations, integration processes, and data preparation necessary to pipeline data from sources to targets for analytics and applications. As such, data engineers have a marked interest in the pipeline functionality of these tools, and they rely on these tools to ensure data reliability, validation, and testing for data quality.
- Data stewards: Data stewards are responsible for carrying out data usage and security policies related to data governance, data quality, data reliability, and ongoing data health as determined through enterprise data governance initiatives. They act as liaisons between the IT departments and business side of an organization. Monitoring and alerting features of data observability solutions are critical to data stewards, who are also familiar enough with line-of-business conventions to specify data quality standards, understand how data is used by the business, and help define some of the terminology to which data refers.
- Business users: Oftentimes, the low code visual approaches of data observability platforms, coupled with some of their more self-service characteristics, appeal to business users. They may be notified when there are incidents and have an interest in assessing the upstream and downstream impact for their workflows. Additionally, these users are influential in helping define conventions for data quality, business glossaries, and data validation.
Table 1. Vendor Positioning: Market Segment and User Segment
|SMB||Large Enterprise||Specialized||Data Engineers||Data Stewards||Business Users|
|Exceptional: Outstanding focus and execution|
|Capable: Good but with room for improvement|
|Limited: Lacking in execution and use cases|
|Not applicable or absent|
3. Key Criteria Comparison
Building on the findings from the GigaOm report “Key Criteria for Evaluating Data Observability Solutions,” Table 2 summarizes how each vendor included in this research performs in the areas we consider differentiating and critical in this sector. Table 3 follows this summary with insight into each product’s evaluation metrics—the top-line characteristics that define the impact each will have on the organization.
The objective is to give the reader a snapshot of the technical capabilities of available solutions, define the perimeter of the market landscape, and gauge the potential impact on the business.
Table 2. Key Criteria Comparison
|Schema Change Monitoring||Data Pipeline Support||AIOps||Advanced Data Quality||Edge Capabilities|
|Exceptional: Outstanding focus and execution|
|Capable: Good but with room for improvement|
|Limited: Lacking in execution and use cases|
|Not applicable or absent|
Table 3. Evaluation Metrics Comparison
|Contextualization||Ease of Connectability or Configurability||Security & Compliance||BI-Like Experience||Reusability|
|Exceptional: Outstanding focus and execution|
|Capable: Good but with room for improvement|
|Limited: Lacking in execution and use cases|
|Not applicable or absent|
By combining the information provided in the tables above, the reader can develop a clear understanding of the technical solutions available in the market.
4. GigaOm Radar
This report synthesizes the analysis of key criteria and their impact on evaluation metrics to inform the GigaOm Radar graphic in Figure 1. The resulting chart is a forward-looking perspective on all the vendors in this report based on their products’ technical capabilities and feature sets.
The GigaOm Radar plots vendor solutions across a series of concentric rings, with those set closer to the center judged to be of higher overall value. The chart characterizes each vendor on two axes—balancing Maturity versus Innovation and Feature Play versus Platform Play—while providing an arrow that projects each solution’s evolution over the coming 12 to 18 months.
Figure 1. GigaOm Radar for Data Observability
As you can see in Figure 1, the majority of the vendors are in the Innovation/Platform Play quadrant. This attests to the comprehensive platform approach taken by many of the innovators in this space—all of whom are pure play vendors with dedicated offerings for data observability and related fields like data reliability and data testing.
It’s also notable that all the Outperformers and Fast Movers are grouped in this section, which demonstrates the effectiveness of prioritizing innovation and a holistic platform. The Leaders that are closest to the center of the Radar have adopted this approach, which generally includes expansive data pipeline support like circuit breaker and binning capabilities that only permit healthy data to reach sources.
Breaking from the grouping in the bottom right, two Leaders are in the Maturity/Feature Play quadrant. This denotes the fact that data observability is no longer a niche aspect of the data landscape; it is rapidly being incorporated into broader data management solutions. These vendors are coupling data observability modules alongside ones for data catalogs, data governance, data access security, and more. This development could very well presage the future of data observability as it matures and becomes a mainstay for data management itself.
Inside the GigaOm Radar
The GigaOm Radar weighs each vendor’s execution, roadmap, and ability to innovate to plot solutions along two axes, each set as opposing pairs. On the Y axis, Maturity recognizes solution stability, strength of ecosystem, and a conservative stance, while Innovation highlights technical innovation and a more aggressive approach. On the X axis, Feature Play connotes a narrow focus on niche or cutting-edge functionality, while Platform Play displays a broader platform focus and commitment to a comprehensive feature set.
The closer to center a solution sits, the better its execution and value, with top performers occupying the inner Leaders circle. The centermost circle is almost always empty, reserved for highly mature and consolidated markets that lack space for further innovation.
The GigaOm Radar offers a forward-looking assessment, plotting the current and projected position of each solution over a 12- to 18-month window. Arrows indicate travel based on strategy and pace of innovation, with vendors designated as Forward Movers, Fast Movers, or Outperformers based on their rate of progression.
Note that the Radar excludes vendor market share as a metric. The focus is on forward-looking analysis that emphasizes the value of innovation and differentiation over incumbent market position.
5. Vendor Insights
There are four core facets of Acceldata Data Observability Platform: the solution addresses data reliability in terms of data quality, data reconciliation, and drift (for schema and data), and it has pipeline support for monitoring events and transformations, a compute layer dedicated to cost and performance, and a user layer for delivering insights for team members in varying roles.
The platform has a dashboard, which can filter according to data segments, specific pipelines, and sources, illustrating these policies during the monitoring process. Scores are based on user-defined thresholds, weights, and the point in time that these factors are applied. Users can subscribe to particular policies, get at-a-glance info about whether the data is trustworthy, and drill down as needed.
Acceldata has a software development kit (SDK) for monitoring pipeline reliability. It tracks telemetry data for pipeline jobs; runs and executions; dependencies and lineage between jobs and data; and metadata like data volumes, record counts, and custom metrics. The system provides visualizations of desired data pipelines in one pane of glass and aggregates relevant data reliability checks to underpin root cause analysis when issues are detected. The pipeline monitoring mechanisms include performance metrics and the ability to determine where in pipelines data health issues arise. When necessary, organizations can implement “circuit breakers” to stop pipelines from running, quarantine bad data, and permit the reliable data to continue.
In addition to bulk policies that are applied after automatically scanning data, Acceldata supports writing policies with a rules-based approach. The system has an out-of-the-box rules library so users can expedite rules authoring in a point-and-click manner. There are also features around spend intelligence to ascertain where costs are coming from, with respect to cloud services, compute and pipeline consumption, and more. Costs can be attributed to user-defined business units or individuals and specific queries.
Strengths: Acceldata’s automatically generated bulk policy capabilities, paired with its low-code means of writing policies based on business rules, and circuit breaker support for data pipelines will help almost any organization.
Challenges: Many of Acceldata’s top competitors leverage machine learning (ML) to analyze data history shortly after connecting to sources and then use that analysis as a baseline for determining anomalies right out of the box. Adding such capabilities to its platform would make Acceldata more competitive.
Anomalo was founded in 2018 to help data teams automatically detect data issues, identify their root cause, and resolve them before other business units notice. The platform utilizes a combination of unsupervised data monitoring, validation rules, and metric anomalies for detection. It surfaces alerts with explanations and rich visualizations that users can drill down into for root cause analysis. Anomalo also has resolution capabilities to understand, triage, and rectify issues before organizational objectives are compromised.
Anomalo provides observability on data as it’s pipelined into cloud warehouses, data lakes, and data lake houses. It also facilitates observability on data transformations, synthesis processes, and changes within the above repositories resulting in “new” data. This solution connects to the data in tables of source systems and monitors them in those environments. It supports structured and semi-structured data within a low-code framework.
ML is instrumental to much of Anomalo’s detection capabilities. Self-supervised learning techniques can supply unsupervised data monitoring in which models assess the historical trends for a specific table or dataset and detect significant points of variation. Alternatively, users can define either business metrics or metrics about data quality via a no-code UI, and the system will detect anomalies outside of ranges based on historical patterns. Anomalo also supports traditional validation rules.
Alerts are issued when anomalies are detected using these three methods. Alerts include verbal summaries of anomalies alongside automated visualizations depicting them down to the columnar level for triaging. Similarly, the automated root cause analysis capabilities foundational to the system’s resolution mechanisms provide visualizations of historical patterns related to issues and samples; they’re a click away from the alerts. Anomalo’s ML supplies those visualizations and derives root causes from analyzing the data. Other resolution features include a UI for triaging alerts and data issues and ticket filing in tools like ServiceNow and Jira to orchestrate resolutions.
Strengths: Anomalo’s ML plays a leading role in its detection and root cause analysis for superior automation. Its visual approach to surfacing alerts, exploring their history, and implementing root cause analysis is strong.
Challenges: Improving Anomalo’s capabilities to include inter-table data-diffs could make this solution stronger, as could more advanced support for providing data observability in data pipelines.
Unlike many data quality solutions that have rebranded themselves as data observability platforms, Bigeye was founded after the term emerged. Since then, it has categorized itself as a data observability solution. Its data-engineering-focused platform enables users to connect to data sources and automatically monitor for data health issues via the company’s ML algorithms. Organizations can also apply business rules as the basis for monitoring, alerting, and resolving data observability issues.
Bigeye connects to a gamut of sources, including transactional databases, data warehouses, data lakes, BI tools, and data catalogs. The company expects to add connectors to ETL platforms later this year. The platform employs data profiling and ML to learn historical patterns about data values as a baseline, builds a lineage map of data’s journey from sources to targets, then monitors each stop along the way. The system suggests over 70 data quality metrics for tracking tables across aspects of numeric distribution shifts, duplicate and null rates, and more. Customized monitoring recommendations per dataset entail validity, pipeline reliability, completeness, distributions, and uniqueness. Bigeye’s ML does time series forecasting for predicting anomalies in data values before alerts are generated.
Organizations can also write customized business rules via a template feature to check specific parameters (such as accuracy) before automating them and their alerts. There’s a grouped metric feature enabling these checks to be employed for analysis across any dimension. The templated approach favors reuse of rules via the UI before they eventually become part of ML model training.
Bigeye’s pipeline support includes monitoring aspects of volume, schema changes, and data freshness as data gets to its destinations. Via its Bigconfig solution, engineers can implement monitoring as code, which describes monitoring infrastructure, what to alert for, and when to do so as automatable, reusable code, or they can access Bigeye’s application programming interface (API) for developer tools to assist with monitoring. Bigconfig enables engineers to define as code what Bigeye monitors, so users can dynamically monitor multiple jobs with a few lines of code.
Alerts include visualizations and numerical values showing data’s current anomalous and historical state for easy assessments. They’re issued in channels like Microsoft Teams, Slack, and PagerDuty, as well as inside BI or analytics tooling. The lineage map provides root cause analysis, triaging, and impact analysis. Other resolution characteristics include built-in ticketing and alerting. In addition, Bigeye recently acquired Data Advantage Group, considerably extending its own data lineage and root cause analysis so that it can be automatically mapped throughout data lakes, data warehouses, transactional systems, BI platforms, and ETL tooling.
Strengths: Bigeye’s comprehensive pipeline analysis and integrations with a variety of sources and systems are impressive. It also delivers a sensible blend of low-code and coding functionality to suit the use case at hand.
Challenges: The platform could be enhanced with native functionality for quarantining bad data.
Datafold approaches data observability from a testing and lineage perspective. Its open source data-diff product automates regression testing for pipelining data from sources to targets. The company also has a column-level lineage offering that illustrates dependencies and impact analysis via plug-and-play capabilities. Common use cases include accelerating dbt model creation by enabling engineers to test as they code, automated testing for continuous integration (CI) to preclude deploying bad dbt results, and validating data after data migrations. Datafold also has a catalog to profile source data and consolidate metadata about selected data assets.
With data-diff, organizations can automatically see and quantify differences between any two tables, including schema, rows, and actual values. The solution integrates into CI workflows via GitLab and GitHub and supports CI/CD processes. Its real-time analysis of differences lets users see the impact of code changes across those tables to readily determine their effects.
This approach is suitable for ETL since it reveals how source code changes alter the data traversing data pipelines. Users can access a data-diff report card that provides at-a-glance comparisons of production and development environments. These impact analysis reports are shareable among users.
SQL code reviews become easier with data-diff, which can reveal how regular expressions impact data models and pipelines and whether the correct business logic is being used in “CASE WHEN” statements. Data-diff also lets organizations specify audit metrics and audit changes to specific rows and values to further validate business logic.
Datafold’s column-level lineage product connects to data warehouses, analyzes existing SQL statements, then generates a metadata graph. It facilitates upstream root cause analysis and downstream impact analysis while showing how data is produced, consumed, and transformed. It effectively gives users a comprehensive overview of their data lineage, enables them to drill down into specific tables, trace flows at the columnar level, and view SQL statements for each step in the pipeline.
These mechanisms make it easy to track desired data types, such as personally identifiable information. A GraphQL Metadata API can be used to query the metadata in other systems and import data lineage from external systems; currently, the only supported data catalogs are DataHub and Amundsen. These capabilities let users prioritize data migrations based on dependencies and pipeline usage while denoting stale data that’s no longer regularly used.
Strengths: The automation and reliability of Datafold’s testing for data pipelines renders this critical job easy. Its data lineage graph offers solid impact analysis and sheds light on root cause analysis.
Challenges: Datafold is less applicable for unstructured data. It also lacks some of the more sophisticated functions of competing data observability solutions, like circuit breakers and the ability to automatically generate thresholds for data quality metrics and rules based on historic analysis of data.
Decube Data’s platform combines data observability with a data catalog and data governance solution. The offering uses a fair amount of automation to expedite and perform many key tasks for data observability.
When users connect sources to Decube, the platform automatically detects thresholds for tables based on metadata analysis. Aspects of this metadata are stored in the data catalog, which users can access to specify tables they want monitored. Schema drift monitoring is automatic, while other options include frequency of updates, freshness, and volume. Users can also select if they want notifications issued for these metrics.
Field-level tests vary by data type and include null checks, cardinality, uniqueness, string length, and regex matching, among others. Organizations can also write SQL scripts to write tests for their own needs based on business logic.
Decube Data has a data dashboard that provides an at-a-glance view of factors impacting data quality, which includes total incidents and incidents involving schema changes, volume, freshness, and field health. Organizations can choose which sources they want to see this information for and denote specific time frames. The platform also provides a health score for the overall data.
Additional features let users monitor pipelines and dbt models with a low-code configuration approach. Organizations can select alerts and where to send them in addition to specifying responsibilities for each job or dbt model. The platform also supports data-diffing between datasets to facilitate data reconciliation. Alerts are grouped to reduce alert fatigue and can be delivered through Slack, webhook, and email. Decube also provides comprehensive data lineage to help discern root cause analysis as well as specific pipelines, data sources, and jobs affected by incidents.
Strengths: The metadata management supported by Decube’s data catalog and the metadata harvesting capabilities it offers help organizations quickly begin their data observability endeavors.
Challenges: Greater out-the-box automation for aspects of data quality and data reliability such as column completeness and column sums could make this solution better, as could more extensive documentation on the company’s website, specifically about its data pipeline support.
Experian Aperture Data Studio approaches data observability through a data quality lens. It connects to numerous big data, SQL, and NoSQL sources and has an SDK and API to link to other platforms. The solution is largely codeless, with processes visually defined and manipulated through its UI.
During data discovery, the system profiles the data automatically (with over 70 data profiling analyses) and tags data assets via ML. There’s a function library with reusable data quality checks and a user interface for non-techies to devise their own data profiling rules. These mechanisms become the basis for detecting changes to data profiles and alerting users. The data discovery process is enriched by reference data from Experian datasets from over 240 countries, which can aid domain-specific validation efforts.
Experian Aperture Data Studio comes with out-of-the box matching rules for disambiguating entities. Supervised learning tunes those rules to organization-specific use cases. The aforementioned reference data is helpful for further confirming matches in cases where confidence isn’t high. Organizations can implement business rules for the fields relevant to specific data types or entities too. When new sources are added, the system can suggest whether or not to apply those rules based on the ML data discovery and tagging.
Data pipeline capabilities involve diagramming features for users to define and visualize workflows as well as view and analyze data at each stage. The platform provides ongoing monitoring of data with deep learning techniques on a field-by-field basis to pinpoint patterns requiring alerts. This anomaly detection involves predictive time series analysis to analyze patterns over time. Additional functionality includes analyzing individuals’ use of the platform and disseminating alerts to them based on that behavior. There are also tools for reporting and determining the business impact of issues. An analytics engine enables users to perform various aggregations, joins, and filters.
Strengths: Experian Aperture Data Studio’s automatic data profiling capabilities are well-adapted for enabling users to understand what data they have and how to monitor it accordingly. The solution’s expansive use of ML provides valuable automation for data discovery and monitoring, and the reference data enrichment is a key differentiator.
Challenges: The support for classic aspects of data observability, such as root cause analysis, isn’t as strong as that of competitors.
Databand, acquired by IBM in 2022, furnishes continuous data observability to improve engineering, quality, and reliability of data. It’s designed to “shift left” the detection of data incidents for swift resolution prior to their impacting business processes.
Databand automates metadata collection from sources and data pipeline tooling ( such as dbt, Spark, Airflow, Databricks, and Snowflake), then profiles pipeline behavior to construct a baseline for how pipelines function. It also profiles the data to help develop SLAs. Databand employs statistical AI approaches to detect and provision alerts for relevant users and relies on comprehensive data lineage for both impact and root cause analysis.
Data quality capabilities entail monitoring SLAs, detecting null records before they reach data warehouses, and surfacing outliers like unusual column changes. Users can also access temporal snapshots to glean the effects of data that fails checks.
Databand lets organizations connect to and visualize data pipelines and processes to identify missing data operations, time-consuming run durations, and failed jobs. This allows users to see trends in real time and detect anomalies based on metadata. The comprehensive platform approach also unifies error logs to determine where and why pipelines failed. It provides historical trend analysis of impacted datasets too. The solution utilizes its own open source library with resources for automating DataOps processes, tracking pipeline health, and monitoring data quality.
There’s a single pane of glass for data quality and data reliability incidents accompanied by instant alerts on pipeline performance, data quality metrics, and SLAs. Users can define, customize, catalog, receive, and profile alerts in one place, and can also route them through channels like email, PagerDuty, and Slack.
Root cause analysis is supplied by cross-tool data lineage and error logs. Visualizing this information is critical for viewing upstream and downstream details, understanding which areas have or will be affected, and triaging incidents. Users can also create workflows for resolving data reliability and data pipeline issues, and to reinforce data integrity.
Strengths: The visual nature of IBM Databand–and its low-code capabilities for visualizing trends and issuing alerts about anomalies–makes it appealing to a broad population of users.
Challenges: Some of the solution’s pipeline functionality around binning and circuit breakers isn’t as sophisticated as that of its top competitors.
Informatica’s data observability stack is embedded within its broader ecosystem of product offerings and accessed through its Intelligent Data Management Cloud. Its ML relies on the vendor’s AI engine, CLAIRE. The metadata management takes advantage of its data catalog, enhancing FinOp and data governance capabilities. This results in comprehensive capabilities for establishing data lineage, data quality, impact analysis, root cause analysis, and anomaly detection to minimize data drift and optimize data delivery.
The platform connects to the stack’s subsystems to profile data and generate statistics about it, which informs data quality values, data quality scores, and data quality scorecards viewable through the catalog and governance modules. Alerts are issued when scores decline, and the system monitors data even when the scorecards aren’t accessed.
Alerts and data visualizations are exposed through a customizable dashboard widget that supports filtering based on custom queries. Alerts are accessible as objects in the catalog and governance modules as well as through email and common collaboration/messaging platforms such as Slack and Teams. Options designed to prevent alert fatigue, such as customizing or aggregating alerts by category, are also available.
Pipeline features entail data traffic analysis buttressed by a statistical analysis of patterns related to data movement. Largely backed by CLAIRE, the system can shut down pipelines when anomalies are detected, deliver provisioning suggestions for under- and over-provisioned resources, and auto-scale to optimize pipelines. Users get visibility into every phase of a pipeline, while technical and business data lineage facilitates impact analysis, root cause analysis, and visual inspections of problem areas. Once incidents are detected, an incident management workflow is triggered. Users can store observations of incidents over time and customize them according to retention and frequency. CLAIRE evaluates incident relevance for triage purposes.
The auto-remediation of the health of data pipelines naturally lends itself to FinOps by controlling costs for moving data. Additional FinOps features involve controls for resource-level governance, a usage-based calculator, geo-proximity details, preset usage quotas, telemetry data, capacity-planning functions, and more. Predictive resource consumption analytics are planned for later this year.
Strengths: Informatica’s extensive data observability support for data pipelines and its visualizations and scores for data’s health are credible, and its FinOps capabilities are a nice bonus.
Challenges: A standalone data observability product accessible independent of Informatica’s burgeoning ecosystem may appeal to some users.
Unlike most data observability offerings that connect to data sources, Kensu’s focus is to connect to data at the application level to monitor, troubleshoot, and effect continuous prevention of data incidents. It supplements its data validation measures with pipeline observability, and it installs via Docker to deploy in any cloud and on-premises. Customers can use both Kensu Core, a front-end web application, and Kensu Hub.
When solutions connected to Kensu run, they produce metadata that Kensu uses to generate metrics about the underlying data’s health rather than using the data. Kensu integrates agent codebases into application code to glean observability and data pipeline statistics. Users can also program rules. The platform creates ML recommendations about anomalies, outliers, and metrics based on the historical behavior of the application data (meaning data that applications are reading and creating).
This approach produces a couple of desirable outcomes. By providing ML recommendations at the application level, Kensu is informally “educating” those applications to generate metrics about their data rather than just about the application performance. This method also facilitates data observability on edge devices and in edge computing environments, which not all vendors can do. It even factors into Kensu’s ability to support streaming data and unstructured data in addition to structured and semi-structured data.
Users can augment the platform’s ML-based monitoring with business rules pertaining to SLAs and business logic.
The solution can inspect every application in a pipeline to determine where in the pipeline failure or anomalous behavior occurred and the corresponding downstream impacts. It does so by defining, monitoring, and accessing alerts for metrics pertaining to applications in pipelines. Anomalies result in notifications for tickets about specific incidents. Tickets or alerts detail data lineage, applications involved, specific rules or metrics invoked, and particular user-specified projects the anomaly affects. This information is useful for impact analysis. Data lineage reveals both downstream and upstream (root cause analysis) implications, can be filtered according to applications, and illustrates time series visualizations for both relevant rules and metrics. Kensu also performs root cause analysis on the actual data itself so that it’s not solely reliant on data lineage for this activity.
Another core use case for the platform is to prevent future data health issues and to anticipate propagation. Once data incidents occur, notifications are issued, and remediation happens, users can create additional rules related to notifications for similar occurrences in the future. Examples include expanding Kensu’s detection for things like new, modified, and missing fields so alerts are generated when applications result in these mishaps.
Kensu’s short-term roadmap includes an updated control center with dashboarding capabilities to provide self-service functionality for different actions within the platform.
Strengths: The assortment of use cases Kensu supports based on its application-level connections, including data of any structure variation, in edge environments, and in streaming deployments, is as broad as any offering’s is in this space.
Challenges: Although Kensu can perform root cause analysis on data (as opposed to its data lineage), some organizations might find its focus on metadata-based monitoring (rather than data-based) to be a limitation.
Monte Carlo Data
Monte Carlo Data is one of the seminal data observability vendors. Its founders are largely responsible for popularizing the data observability term and parlaying it into a major data management subcategory.
This comprehensive solution employs various methods to detect, resolve, and prevent data downtime. It integrates with myriad source systems, data movement platforms, orchestration tools, and GitHub. The solution also features a low-code GUI that frequently pairs numerical and visualized data to illustrate the state of data’s health over time. A significant amount of automation is provided by self-supervised ML, while in-app integrations and data catalog integrations disseminate alerts. Monte Carlo also contains a catalog for metadata and data profiling metrics.
Monte Carlo Data collects metadata, logs, metrics, and aggregated statistics to monitor data and data pipelines. It never pulls data out of sources or pipelines to facilitate observability, which reinforces governance and security. The platform ingests metadata and metrics about data pipelines that are useful for optimizing costs and performance. When paired with core quality assessments of data, users can understand how pipelines are impacting data health. Integrating Monte Carlo Data into pipelines enables circuit breakers to stop data from flowing through pipelines, preventing downstream issues if quality standards aren’t met.
ML techniques analyze historical data patterns in source data to infer expected behavior for data and data pipelines. This creates out-the-box monitoring, anomaly detection, and metric derivation without coding or writing rules, though users may implement rules programmatically for these purposes. Monitoring is based on metadata and data; the system continuously pulls the former to detect changes—like row count in tables—while users can query batches of data via SQL every time monitoring runs for more sophisticated analysis.
Metadata is essential for triaging incidents and providing at-a-glance details about impacted tables, specific queries running against them, and BI views. Table- and field-level lineage models detail data’s journey from sources to destinations to automatically pinpoint where incidents arose. Other features include side-by-side query log analysis to compare queries before and after incidents.
Strengths: The platform’s diverse compilation of source connections, low-code user interface, ML monitoring capabilities, strong data pipeline support, and data catalog make this an extremely well-rounded offering out the gate.
Challenges: While Monte Carlo Data provides automated monitoring of metadata for data observability, SQL statements are required for more profound analysis, which some customers might find cumbersome.
Precisely’s data observability capabilities are part of its larger product suite for reinforcing data integrity. In addition to data observability, the Data Integrity Suite includes tools for data quality, data integration, data governance, data enrichment, geo addressing, and spatial analytics. Its data observability functionality integrates with its well-known data catalog and a data flow designer; the latter can be used for data pipelines. A consolidated dashboard enables users to assess their data health across the organization.
A cloud-native offering, Precisely supports a number of data sources; some of the more commonly used include Databricks, BigQuery, Snowflake, Redshift, and Microsoft Synapse. The system offers data profiling to help users understand what their data is. Business metadata and technical information about assets are stored in the data catalog. This information can be enriched in the catalog with reference data, business logic, and additional metadata models, and it serves as a starting point for monitoring data to see if significant changes to it occur.
There are metrics to determine the popularity of specific assets, which is used to triage data management tasks accordingly. Anomaly detection involves the use of ML directly within the data observability module and rules-based validation mechanisms courtesy of the data quality features. This tandem provides continuous monitoring of data and aspects of data pipelines for up-to-date data health information.
A low-code user interface allows organizations to visualize trends, alerts, and data pipelines, the last of which assists with evaluating dependencies needed to perform root cause analysis and impact analysis. Data lineage views, some of which involve other services in Precisely’s Data Integrity Suite, reinforce upstream and downstream analysis of data incidents. There are also analytics functions to assess whether replicated data in targets is identical to source data. Alerts are automatically issued when AI techniques perceive outliers and indicators of data drift.
Strengths: Precisely’s holistic approach to data observability is comprehensive enough to satisfy multiple enterprise use cases.
Challenges: The Data Observability service of the Precisely Data Integrity Suite could be improved with greater support for automating, connecting to data sources, data quality rules, and metrics about datasets.
Sifflet’s data observability platform is predicated on counteracting “data entropy.” It comprises data cataloging, data lineage, and advanced data monitoring mechanisms to identify data health incidents, understand their impact, perform root cause analysis, plan, troubleshoot, and prevent recurrences.
Although the data catalog isn’t expressly designed for data governance, it offers many features of standalone catalogs. Once connected to source systems, the catalog scans them and begins populating itself with metadata, descriptions from transformation tools, and the health status of data. Common sources include Snowflake, dbt, Postgres, Airflow, Hive, Athena, and more. Supervised learning algorithms suggest tags that users can modify or augment with their own.
By analyzing the source data’s historical patterns, Sifflet’s solution automatically generates rules about the behavior of rows, columns, and tables when connecting to data sources. Those rules become the basis for detecting aberrations. ML models typically create rules for metrics like duplicates, schema changes, completeness, data freshness, and formatting. Other automated monitoring coverage applies to accuracy, consistency, validity, timeliness, and uniqueness. These ML models are updated via user feedback.
Organizations can also train the system’s ML for their own metrics to automatically generate additional rules. In addition to metrics, these rules are based on static thresholds and user selected benchmarks. Alternatively, users can author SQL rules. Sifflet has approximately 50 monitoring templates spanning rules and ML-based techniques that are applicable to both data and metadata. Alerts are generated for data that fails monitoring checks.
Connecting to source data automates aspects of data lineage, which Sifflet uses for upstream root cause analysis and downstream understanding of the business impact of data health incidents. This metadata includes field-level lineage, views of links between assets to understand dependencies, and details about transformations. The monitoring capabilities involve incident management in which tickets are automatically generated whenever rules about data’s health fail.
Data pipeline support includes integrations with orchestration tools like Airflow and Prefect to implement circuit breaker techniques that prevent data outside of established boundaries from continuing in pipelines. The solution has an API for monitoring data at every step in a pipeline.
Strengths: Sifflet’s data cataloging functionality is a point of competitive differentiation and a centralized place in which to store metadata descriptions, automatic and user-defined tags, and status checks about data. The automation characterizing the cataloging aspect of the solution, its generation of rules, and facets of its incident management is impressive.
Challenges: Sifflet’s ability to filter out data of insufficient health is not as extensive as some of its competitors’.
Soda was formed approximately four years ago to empower organizations to establish and share quality data. The platform comprises several components, including Soda Library for ensuring reliable data pipelines, Soda Cloud (a SaaS platform) for self-service and collaborative data quality, and Soda Agent, which secures connections between Soda Cloud and data sources while empowering users to define data quality checks. There are data contracts for formalizing checks and tests for specific data during this process. Soda Checks Language (SCL) is a declarative means of implementing business rules for checks; SodaGPT uses foundation models so organizations can create checks as code via natural language commands.
Soda can be implemented in pipelines like Airflow, in development settings like Git, and inserted at the end of the pipeline processes when data persists for event-based paradigms. Once data quality checks are conducted, results are sent to Soda Cloud, which disseminates notifications as needed. Soda’s automated approach to monitoring, alerting, and addressing data quality issues is multifaceted. The system uses ML time series anomaly detection to scrutinize historical data and monitor it for changes to common metrics like completeness and data currency.
The platform also assigns a data health score based on this initial analysis, and there are options for profiling and sampling data as well. Other automation includes using Soda GPT, an LLM approach to writing data quality checks in natural language that generates SodaCL (a YAML-based Domain Specific Language that prepares SQL queries for data quality checks) to implement rules in the underlying system.
Data contracts are SLAs for specifying source data, naming it accordingly, and establishing data integrity standards. Contracts provide an opportunity for users to formalize specific observability checks, rules, and business logic on top of the automated capabilities. Dashboards can be set up to monitor incidents, number of checks on the dataset, and the health score. When the system identifies variations from rules or historical patterns, alerts are issued via tools like Teams or Slack with diagnostic information about failed rows, historical records, and the current data health score.
Root cause analysis involves either the code base found in data pipelines or issues inside sources. Soda enables users to create incidents based on alerts, which can be incorporated into defined resolution workflows using tools like ServiceNow, Jira, and others. Circuit breaker techniques can stop pipelines from transmitting data that fails checks; it’s also possible to quarantine that data and enable healthy data to progress through pipelines. Soda Library embeds Soda within pipelines, where ML can suggest checks and test pipelines in CI/CD workflows for issues. Soda also lets engineers test data quality as part of their pipelines so when new data is produced, transformed, or integrated, it’s also tested and validated.
Soda integrates with popular data catalogs like Atlan, Collibra, Alation, and more. The data observability solution captures metadata about data ownership, which is then used to communicate with owners for reviewing quality checks and resolving issues when there are failures.
Strengths: Soda’s data pipeline support, natural language rules-writing capability, and other forms of ML are useful and practical, as are its integrations with data catalogs for alerting relevant parties about checks, incidents, and resolutions.
Challenges: Soda’s platform could be enhanced with more native support for data lineage, which is primarily facilitated through third-party data catalog integrations.
Telmai’s data observability solution is based on a low-code/no-code framework that employs data science to predict different facets of data’s health. It has an API-based open architecture that automates data quality KPI monitoring and anomaly detection at the attribute level. Telmai leverages Apache Spark as its analytics and ML computation engine, which enables it to observe data at scale without sampling, previews, or inordinate reliance on metadata.
The platform’s data observability applies to structured, semi-structured, and streaming data, message queues, data warehouses, cloud storage, data lakes, and APIs. The platform includes, yet goes beyond, assessing data reliability according to metadata to provide column-level patterns and anomalies about specific data values.
Telmai automatically scans source data, extracts relevant metrics, and organizes them into automatic data quality KPIs (via semi-supervised, supervised, and unsupervised learning). The platform has approximately 40 out-of-the-box metrics that are applicable for any data; the system measures and monitors freshness, consistency, accuracy, completeness, uniqueness, and validity. It also provides high-level health metrics and inter-pipeline data lineage. Automatic data profiling at scale helps the system understand the data and establish quality metrics.
Human-in-the-loop approaches let organizations fine-tune ML-derived monitoring criteria. Alternatively, users can fine-tune monitoring criteria with interactive rule-building functionality to implement validation rules, business policies, and SLAs for requirements. Additionally, they can define calculated attributes, and users can devise custom views for monitoring via SQL’s GROUP BY features and joins.
Telmai analyzes data quality issues in its own platform via Spark. Since it doesn’t push down queries to the underlying databases or data warehouses, it doesn’t create any operational overhead on those systems or increase the usage/license costs of them. Telmai supports change data capture (CDC), Delta architectures, and event streaming to compare updates in data with historical metrics and flag drifts and outliers. These predictive algorithms involve time series analysis to show changes over time. There are two categories of anomalies detected: those in metric drifts over time and those for data value/record outliers at any time in datasets.
Telmai integrates into data pipelines via REST APIs to orchestrate data observability through tools like dbt, Airflow, and others. There, it validates data and performs anomaly detection, which can be viewed through its UI or via notifications and alerts. Binning capabilities separate anomalous data that fails validations from reliable data. The system also supports circuit breaker capabilities and enables users to create support tickets when poor data is spotted.
Reconciliation features can resolve differences in data between source and target systems. Alerts for insufficient data health include visualizations and actual values. There’s also segmentation and historic trend analysis to help with root cause analysis and determine how to remediate incidents.
Strengths: The platform’s low-code functionality, considerable automation, and data pipeline support will go a long way with users, as will its attribute-value anomaly detection and support for open architecture.
Challenges: Native integrations with common pipeline and orchestration tools like dbt and Airflow would make this solution more attractive. Although its API supports integrations with several data catalogs, native integrations with an even greater array of them, or a native catalog addition to Telmai, could help put it at parity with its stiffest competitors.
Validio’s data observability platform is predicated on collaborative efforts between data teams and business users for the right blend of context to understand the state of the data’s health. The solution deploys through Kubernetes in the cloud but also as a hosted service that supports structured and semi-structured streaming data, data warehouses, data lakes, and data lake houses. GCP, Azure, and AWS users can employ Validio for data sources outside of those three cloud providers.
Once Validio is connected to source data, it profiles the data, and ML models analyze the data’s historical state according to common data quality and data validation standards. This process trains the models in real-time to establish thresholds for these metrics. The thresholds are applicable to metadata, actual data, and data points at a “segment” or table level, which is helpful for monitoring metrics pertaining to data drift, shifts in distributions, reference sources, schema changes, and more. Validio can also recommend validation checks down to the column level and out-of-the-box to begin monitoring these aspects of data health for variations from the established ranges right away.
Users can also specify data segments for Validio’s predominantly unsupervised learning models to implement static and adaptive thresholds. With the latter, a data incident is created the first time data changes, an alert is issued, and users can address the incident as they see fit. However, after that initial change, the threshold adapts to encompass the change so that it’s no longer anomalous.
Alerts about data outside of thresholds are sent through channels like Jira, Slack, and Webhooks, Microsoft Teams, and APIs. The platform is navigable via infrastructure-as-code approaches for technical users and a GUI for business users. The GUI enables non-techies to connect to data sources and onboard datasets, then the system pre-populates table names and recommendations about what to monitor in minutes. The GUI also offers a visual means of creating user-defined validators. Alternatively, Validio offers a command-line interface (CLI), API, and a Python SDK to code these functions.
The GUI’s dashboard provides an overview of the state of data across all sources, with stats about the overall data health and number of incidents. Users can hone in on certain segments or sources as desired. Validio also presents time series views in its alerts and dashboard for each validator, so users can see current data points relating to thresholds for validation checks. These views also enable users to change the configuration of validators or the data segments where they’re applied. Anomalies or incidents are highlighted, and there are features for creating support tickets and alerting others.
The system’s root cause analysis involves anomaly detection and segmentation techniques. Anomaly detection invokes an ML model that pinpoints outliers according to trends and seasonality; evaluating outliers with segmentation techniques can reveal their cause. Validio facilitates observability in data pipelines via an API that communicates with the pipeline orchestrator to implement circuit breakers.
Strengths: This platform’s ease of use, facilitated by its quick connections to sources, recommended validators, and ML thresholds, is appealing.
Challenges: Increased native support for data observability in pipelines will improve this offering.
6. Analyst’s Take
The data observability domain is quickly evolving to keep pace with the ever-shifting data types, applications, and use cases for which organizations require optimal data health. Many of these platforms ease the process of implementation with connections to popular data sources and with mechanisms to fortify security and data governance concerns around regulatory compliance. The technical aspects of employing these products are supported and enhanced by a variety of characteristics that serve as key takeaways for potential users, including:
- Learning data’s normal state: Various techniques are responsible for automating the process of comprehending the normal state of data in specific systems, pipelines, and applications. Inference mechanisms and ML models accomplish this task for organizations as a baseline on which to predicate anomaly detection—which they also facilitate.
- End-to-end monitoring: One of the distinctions of data observability tools is they monitor data’s health when data is at rest or in motion in data pipelines. Organizations don’t have to wait for data to land to determine if values, tokens, or schematic information in columns has drifted.
- Mainstays: The monitoring, triaging, alerting, and corrective actions that typify data observability platforms are applicable to aspects of metadata, data quality, data modeling, and schema. These capabilities rely on interactive visualizations, dashboards, and reporting for real-time updates about the state of data’s health.
- Data protection: The comprehensive nature of the monitoring prowess of these platforms is bolstered by robust security measures such as role-based access controls, identity and access management, and standards for complying with specific regulations like SOC 2.
- Root cause analysis: An artful combination of diagnostics and analytics, drill-down capabilities in dashboards, and data lineage enable these solutions to pinpoint where “data failure” occurred, to spur remediation efforts. The investigatory nature of accomplished tools in this discipline is a valuable time-saver, preventing lengthy querying and manual scrutiny of numerous individual systems in the process.
The specificity of data observability offerings for mitigating the occurrence of data drift and data downtime makes them more practical than addressing these issues with respective tools for data quality, testing, general observability, etc. Moreover, they enable these capabilities at enterprise scale in distributed settings across on-premises, hybrid, and multicloud environments. Decisions about which solution is the most appropriate for a particular organization should involve consultation between technical and business teams regarding SLAs, individual use cases, compliance features, and source connectivity.
7. About Andrew Brust
Andrew Brust has held developer, CTO, analyst, research director, and market strategist positions at organizations ranging from the City of New York and Cap Gemini to GigaOm and Datameer. He has worked with small, medium, and Fortune 1000 clients in numerous industries and with software companies ranging from small ISVs to large clients like Microsoft. The understanding of technology and the way customers use it that resulted from this experience makes his market and product analyses relevant, credible, and empathetic.
Andrew has tracked the Big Data and Analytics industry since its inception, as GigaOm’s Research Director and as ZDNet’s original blogger for Big Data and Analytics. Andrew co-chairs Visual Studio Live!, one of the nation’s longest-running developer conferences, and currently covers data and analytics for The New Stack and VentureBeat. As a seasoned technical author and speaker in the database field, Andrew understands today’s market in the context of its extensive enterprise underpinnings.
8. About GigaOm
GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.
GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.
GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.