This GigaOm Research Reprint Expires Mar 2, 2024

GigaOm Radar for Data Lakes and Lakehousesv1.0

1. Summary

Organizations need to manage data on a large scale that is stored in different formats—structured, unstructured, or semi-structured—without having to rely on proprietary software, as with data warehouses. Data lakes allow organizations to easily, and with very little maintenance or structure, store and query large amounts of data.

As a result, many data lakes are compatible with many different types of file formats, including CSV (comma-separated values), Parquet, and newer formats like Delta Lake and Iceberg. Additionally, many data lakes (and the query engines built to analyze the large-scale datasets within them) leverage an underlying open source technology, support open file formats, and handle security and governance through integration with additional open source technologies, such as Apache Ranger and Atlas.

The past, present, and future of data lakes are intertwined with those of the data warehouse. Both solutions originated with attempts to find a single optimal solution to enterprise data management. Additionally, over the past year, the term “lakehouse” has moved from a novel, somewhat esoteric moniker into the mainstream. A lakehouse is a solution that attempts to blend capabilities of data warehouses and data lakes together. The blending is done by implementing query engine features that are designed to bring the optimizations and performance of a data warehouse to a data lake. Proponents of this architecture describe a lakehouse as an optimal blend of data lake and data warehouse approaches.

Today, there is a wide range of opinions, philosophies, and marketing biases within the industry regarding the relationship between data lakes and data warehouses. Some vendors, like Snowflake, are proponents of a data-warehouse-only approach. Others—like Microsoft, Google, and Oracle—provide users with the choice of some combination of a data lake, lakehouse, and/or data warehouse within the same product offering. Still others—like Databricks, Cloudera, and Dremio—stick to lakehouse offerings exclusively but emphasize that they are elegant hybrids of data lake and warehouse technology, obviating the need for a combination.

Regardless of the specific technology or label—lake, lakehouse, warehouse—the most important factor for organizations to focus on when selecting a product is the use case it must address. To that end, this report aims to assist organizations in their decision-making process, to help them select the solution that best suits their needs.

This GigaOm Radar report highlights key data lake and lakehouse vendors and equips IT decision-makers with the information needed to select the best fit for their business and use case requirements. In the corresponding GigaOm report “Key Criteria for Evaluating Data Lake and Lakehouse Solutions,” we describe in more detail the capabilities and metrics that are used to evaluate vendors in this market.

How to Read this Report

This GigaOm report is one of a series of documents that helps IT organizations assess competing solutions in the context of well-defined features and criteria. For a fuller understanding, consider reviewing the following reports:

Key Criteria report: A detailed market sector analysis that assesses the impact that key product features and criteria have on top-line solution characteristics—such as scalability, performance, and TCO—that drive purchase decisions.

GigaOm Radar report: A forward-looking analysis that plots the relative value and progression of vendor solutions along multiple axes based on strategy and execution. The Radar report includes a breakdown of each vendor’s offering in the sector.

Solution Profile: An in-depth vendor analysis that builds on the framework developed in the Key Criteria and Radar reports to assess a company’s engagement within a technology sector. This analysis includes forward-looking guidance around both strategy and product.

2. Market Categories and User Segments

For a better understanding of the market and vendor positioning (Table 1), we assess how well data lake and lakehouse solutions are positioned to serve specific market categories and user segments.

For this report, we recognize three market categories:

  • Small-to-medium business (SMB): In this category, we assess solutions on their ability to meet the needs of organizations ranging from small businesses to medium-sized companies. Also assessed are departmental use cases in large enterprises, where ease of use and deployment are more important than extensive management functionality, data mobility, and feature set.
  • Large enterprise: Here, offerings are assessed on their ability to support large and business-critical projects. Optimal solutions in this category have a strong focus on flexibility, performance, data services, and features that improve security and data protection. Scalability and the ability to deploy the same service in different environments are big differentiators.
  • Specialized: Optimal solutions are designed for specific workloads and use cases, such as big data analytics and high-performance computing (HPC).

In addition, we recognize four user segments for solutions in this report:

  • Business user: Business users are typically beginners in the realm of data and analytics. While these employees may occasionally need to use analytical tools to perform self-service exploration and analysis, they rely on others to handle the technical aspects of configuring and provisioning them.
  • Business analyst: These users have some knowledge of data analysis tasks and are familiar with using self-service tools to perform analytics. They evaluate data from the perspective of deriving business insights and making recommendations for improvements, such as better performance or cost reduction.
  • Data analyst: These users review data to look for trends and patterns that can benefit organizations at the corporate level. While not as technical as data engineers, data analysts possess knowledge of data preparation, visualization, and analysis that can be applied to inform organizational strategy.
  • Data engineer: Data engineers are very well versed technically and apply their specialized knowledge to help prepare, organize, and model data, transforming it into actionable information for the organizations they support.

Table 1. Vendor Positioning

Market Segment

User Segment

SMB Large Enterprise Specialized Business User Business Analyst Data Analyst Data Engineer
Ahana
Alluxio
AWS
Cloudera
Databricks
Dremio
Google
HPE
IBM
Microsoft
Oracle
Starburst
3 Exceptional: Outstanding focus and execution
2 Capable: Good but with room for improvement
2 Limited: Lacking in execution and use cases
2 Not applicable or absent

3. Key Criteria Comparison

Building on the findings from the GigaOm report, “Key Criteria for Evaluating Dake Lake and Lakehouse Solutions,” Table 2 summarizes how each vendor included in this research performs in the areas we consider differentiating and critical in this sector. Table 3 follows this summary with insight into each product’s evaluation metrics—the top-line characteristics that define the impact each will have on the organization.

The objective is to give the reader a snapshot of the technical capabilities of available solutions, define the perimeter of the market landscape, and gauge the potential impact on the business.

Table 2. Key Criteria Comparison

Key Criteria

Query Federation Managed Services Serverless Operation Unified Control Plane Multicluster Structure Open Source Technology Integration Lake & Warehouse Unification
Ahana 2 3 0 2 2 3 2
Alluxio 3 0 0 2 3 3 0
AWS 3 2 3 2 0 2 2
Cloudera 3 3 0 2 2 3 3
Databricks 3 3 0 2 3 2 3
Dremio 3 3 0 2 3 2 3
Google 3 3 3 2 2 2 3
HPE 3 2 0 3 3 3 2
IBM 2 2 3 3 0 2 1
Microsoft 2 3 3 2 0 2 3
Oracle 2 3 3 3 0 2 2
Starburst 3 3 2 2 3 3 3
3 Exceptional: Outstanding focus and execution
2 Capable: Good but with room for improvement
2 Limited: Lacking in execution and use cases
2 Not applicable or absent

Table 3. Evaluation Metrics Comparison

Evaluation Metrics

Scalability & Elasticity Deployment Flexibility High Availability Platform Longevity Ease of Use & User-Friendliness
Ahana 3 2 3 1 3
Alluxio 2 3 2 2 1
AWS 3 1 3 2 3
Cloudera 3 2 3 3 3
Databricks 3 2 3 3 3
Dremio 3 3 2 3 3
Google 3 1 3 2 3
HPE 3 2 2 3 2
IBM 3 2 3 3 3
Microsoft 3 1 3 3 3
Oracle 3 2 3 3 3
Starburst 2 3 3 2 3
3 Exceptional: Outstanding focus and execution
2 Capable: Good but with room for improvement
2 Limited: Lacking in execution and use cases
2 Not applicable or absent

By combining the information provided in the tables above, the reader can develop a clear understanding of the technical solutions available in the market.

4. GigaOm Radar

This report synthesizes the analysis of key criteria and their impact on evaluation metrics to inform the GigaOm Radar graphic in Figure 1. The resulting chart is a forward-looking perspective on all the vendors in this report, based on their products’ technical capabilities and feature sets.

The GigaOm Radar plots vendor solutions across a series of concentric rings, with those set closer to the center judged to be of higher overall value. The chart characterizes each vendor on two axes—balancing Maturity versus Innovation, and Feature Play versus Platform Play—while providing an arrow that projects each solution’s evolution over the coming 12 to 18 months.

Figure 1. GigaOm Radar for Data Lakes and Lakehouses

As you can see in the Radar chart in Figure 1, six vendors are positioned in the Innovation half of the Radar and six are found in the Maturity half. This reflects the fact that while stable players with refined platforms exist, the data lake and lakehouse category is by no means stagnant, and vendors continue to fine-tune their philosophies and approaches to the overall lake/warehouse unification trend.

The majority of vendors are situated in the Leaders Circle or poised to enter it in the near term and have Fast Mover or Outperformer arrows. This shows that vendors are continuously refining their existing offerings and introducing new developments, and we expect them to continue at that pace over the next 12 to 18 months.

The majority of vendors are on the Platform Play side of the Radar, demonstrating the broader scope of these solutions, with many tools and solutions that are relevant to users in this space. Vendors across the industry are continuously introducing new features as they either solidify their positions on the spectrum or widen their platform to give users their choice of lake, lakehouse, warehouse, mesh, or something else.

Cloudera, Databricks, and Oracle are Leaders in the Maturity half of the radar, attesting to their longstanding presence within the industry and their well-refined, robust platforms. Cloudera added support for Iceberg tables to its platform in 2022, furthering support for the lakehouse paradigm. Databricks announced in 2022 the general availability of its Photon engine, which it developed to optimize and enhance query performance on its platform across all three major public clouds—Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

In the Innovation half of the Radar, Dremio has also continued to refine its Sonar query engine, which is based on Apache Arrow and includes the company’s own optimizations, such as the Dremio Columnar Cloud Cache (C3) and Data Reflections. In addition to the lakehouse model, the vendor now states its platform can also handle data mesh architectures. Starburst, an Outperformer in the Leaders Circle, has also grown its platform and community and expanded to handle data lake, lakehouse, and data mesh models as well.

Inside the GigaOm Radar

The GigaOm Radar weighs each vendor’s execution, roadmap, and ability to innovate to plot solutions along two axes, each set as opposing pairs. On the Y axis, Maturity recognizes solution stability, strength of ecosystem, and a conservative stance, while Innovation highlights technical innovation and a more aggressive approach. On the X axis, Feature Play connotes a narrow focus on niche or cutting-edge functionality, while Platform Play displays a broader platform focus and commitment to a comprehensive feature set.

The closer to center a solution sits, the better its execution and value, with top performers occupying the inner Leaders circle. The centermost circle is almost always empty, reserved for highly mature and consolidated markets that lack space for further innovation.

The GigaOm Radar offers a forward-looking assessment, plotting the current and projected position of each solution over a 12- to 18-month window. Arrows indicate travel based on strategy and pace of innovation, with vendors designated as Forward Movers, Fast Movers, or Outperformers based on their rate of progression.

Note that the Radar excludes vendor market share as a metric. The focus is on forward-looking analysis that emphasizes the value of innovation and differentiation over incumbent market position.

5. Vendor Insights

Ahana

Founded in 2020, Ahana offers a cloud-native, fully managed software as a service (SaaS) solution for deploying the Presto Engine on AWS, called Ahana Cloud for Presto. The Presto query engine, also known as PrestoDB, began as a project at Facebook (now Meta) to perform analytics on petabytes of data. From these roots, Presto has evolved into an open source, distributed SQL query engine designed to query large-scale datasets, across both relational and non-relational data sources. Presto is hosted and governed by the Linux Foundation’s Presto Foundation.

Ahana Cloud for Presto is structured with the Ahana SaaS Console, the Ahana Compute Plane, and the Presto clusters. The Ahana SaaS Console functions as a unified control plane from which the Presto clusters are created, overseen, and managed. It includes a web interface where users access and view options to spin up, tear down, and manage clusters. This control plane is layered over the Presto clusters, wherein reside metadata, monitoring, and data sources. The control plane is deployed across multiple AWS availability zones to provide high availability and deployment in a secondary region for disaster recovery.

As one of its main value-add propositions, the vendor promotes that Ahana Cloud for Presto simplifies the installation, configuration, and management of the Presto Engine on a company’s data lake. Ahana Cloud automatically tunes the parameters of Presto out of the box, so that query performance is immediately available upon spinning up a cluster. An included out-of-the-box Apache Superset sandbox lets administrators validate connecting to, querying, and visualizing data.

Users can stop and restart clusters, change the number of Presto workers used in the cluster, and add or remove data sources as well. The Ahana user interface (UI) provides connectors for data sources including AWS Glue data catalog, the Hive metastore, Amazon OpenSearch (formerly Elasticsearch), Amazon RDS, and Amazon Redshift, as well as the ability to define a connector configuration file to connect to any supported Presto connector. Connection to BI tools such as Tableau and Looker is available through open database connectivity (ODBC) and Java database connectivity (JDBC) drivers. Ahana Cloud for Presto also includes basic role-based access controls (RBACs) that restrict access to data residing in object stores, depending upon the user’s role. An integration with Apache Ranger provides fine-grained access controls, authorization, and auditing capabilities.

Strengths: Ahana’s strengths lie in its unified control plane, managed services, high availability, elasticity, connections to other AWS services, and the extended security and access controls of its platform that come from its integration with Apache Ranger.

Challenges: This platform is deployable only on AWS, presenting possible data egress expenses for customers with data in other clouds and making it a suboptimal fit for companies without existing AWS data lakes or other investments.

Alluxio

Originally founded in 2015, Alluxio is based on the open source project of the same name. It was started at the University of California (UC), Berkeley’s AMPLab under the name Tachyon to provide an abstraction layer over data stored across many different storage systems. Alluxio sits between a number of different interfaces and APIs and data in storage. It works by providing a single point of access to multiple data sources, thus presenting a unified view of the underlying data and a standard interface for applications accessing the data.

Alluxio splits storage into two categories. The first, the Under File Storage (UFS), refers to an external file system, such as HDFS or Amazon S3. Alluxio can connect to one or more of these and provide access to them under a single namespace, a metadata concept that allows applications to access multiple independent storage systems through the same interface. Alluxio handles the connections to the different underlying storage systems instead of requiring the application to communicate with each individual storage system.

The second category is Alluxio storage, which makes use of caching to improve performance. Alluxio storage is focused on storing hot, transient data rather than on long-term persistence. Data is stored in memory, solid-state drive (SSD), non-volatile memory express (NVMe), or hard-disk drive (HDD) and can be replicated to make data more available for input/output (I/O) operations.

Users can configure the amount and type of storage for each Alluxio worker node. The worker nodes perform the data operations on data in underlying storage systems, and the data read from the underlying storage system is stored in memory by the workers to serve client requests. Alluxio works with compute engines, including Apache Spark, MapReduce, Flink, Hive, Presto, Trino, and TensorFlow. Client APIs include a Java API, S3 API, REST API, and POSIX API.

Alluxio includes audit logging capabilities to allow system administrators to track users’ access to file metadata. Access control capabilities allow administrators to grant role-based permissions. Alluxio can be deployed on-premises or in any of the three major public clouds.

Strengths: Alluxio’s strengths include its unified control plane, open source technology integration, and optimizations for concurrency and multicluster structure.

Challenges: Alluxio doesn’t provide a managed services offering, so this would not suit an organization without a qualified data administrator or sufficient data engineering talent with relevant skills.

Amazon Web Services (AWS)

Amazon Athena is a serverless, interactive analytics service for analyzing data stored in various data sources, using SQL or Python. The vendor says that Athena is designed to streamline querying data directly in S3 without the need to load the data into Athena or manage infrastructure, and describes optimal use cases for Athena as including performing ad-hoc SQL queries on data in S3 and over 25 data sources, while touting Athena’s ease of setup and use.

Athena includes auto-scaling capabilities and large-scale query execution, which allows it to run queries against large datasets at speed. Athena is serverless, leaving customers with no infrastructure setup or cluster configuration tasks to deal with on their own. The Athena console functions as a unified control plane from which users can define a schema for their data and use the built-in query editor to query it.

Amazon Athena for SQL makes use of the Presto open source distributed SQL query engine for querying data in place in Amazon S3, or in other sources using Athena Federated Query and a variety of prebuilt connectors. Queries can be run across data stored in relational, non-relational, object, and custom data sources running on-premises or in the cloud. Athena’s query federation software development kit (SDK) allows users to build connectors to any data source.

Interestingly, as an alternative to Presto, Athena supports data analytics and exploration with Apache Spark. This offering is also serverless and accessible via Python code in a Jupyter-compatible notebook experience within the Athena console.

Athena supports querying a wide variety of data formats, such as CSV, JSON, ORC, Avro, or Parquet. Geospatial data types, Iceberg tables, and Apache Hudi datasets are also supported. Since Athena is priced per query and charges by the amount of data scanned per query, columnar file formats, like ORC and Parquet, that leverage compression will be more cost-effective than text-based formats like CSV and JSON. ODBC/JDBC drivers allow Athena to connect to a wide variety of BI tools for querying and visualization. For high availability, Athena uses compute resources across multiple facilities (ostensibly, AWS Regions’ Availability Zones), automatically rerouting queries if a particular resource becomes unavailable.

Integration with the AWS Glue Data Catalog provides the ability to create a unified metadata repository. Security capabilities include basic RBACs, as well as integration with AWS Identity and Access management for policy management and the ability to set fine-grained permissions. Athena also supports querying encrypted data and writing encrypted results back to S3 storage, with support for both server- and client-side encryption. Further functionality provided by elements of the AWS stack include the ability for users to invoke Amazon SageMaker machine learning (ML) models in an Athena SQL query to run inference and integration with Amazon QuickSight, the vendor’s BI service, for data visualization and reporting.

Strengths: Athena’s strengths include the ease-of-use derived from its hands-off serverless approach, as well as scalability and elasticity, federated querying, and high availability.

Challenges: Athena is deployable only on AWS infrastructure, and several of its extended capabilities are supplied by other components of the AWS stack (though the platform can query data on-premises and in other clouds), making it less suitable for organizations without Amazon S3-based data lakes or otherwise lacking adoption of the AWS cloud.

Cloudera

Cloudera has multiple technologies and integrations that provide value to users in the lake, lakehouse, and warehouse arena. Cloudera Data Platform (CDP) is the vendor’s flagship offering, which includes solutions for multiple use cases. Cloudera announced last year the general availability of the Apache Iceberg format as a native data table format integrated into its platform. Cloudera says this integration enables its data engineering, data streaming, and ML services and use cases, and that it transforms its data platform into an open data lakehouse.

Leaving Iceberg aside for the moment, we’ll point out that the Cloudera data lakehouse story results from the combination of several technologies, chief among them Apache Hive, Apache Impala, and Apache Spark. Apache Hive was originally designed to enable batch processing workloads, as well as reading, writing, and managing large datasets on Hadoop, all using SQL. Subsequent versions of Hive have increasingly enabled business intelligence-style query and analysis. Impala is an open source massively parallel processing (MPP) SQL query engine, originally also designed for Hadoop data lakes, but bypassing Hadoop’s batch processing engine. Impala uses the same unified storage platform, metadata, SQL syntax, ODBC driver, and UI as Hive, intertwining both tools into a unified solution for batch-oriented interactive queries.

Using the Hive query language (HiveQL), a dialect of SQL, the original versions of Hive converted queries into a series of jobs that executed on a Hadoop cluster. The more modern version, Hive LLAP (alternately either “Low-Latency Analytical Processing” or “Live Long and Process”) follows a hybrid execution model consisting of a long-lived daemon that uses caching, pre-fetching, some query processing, access control, and parallel execution for multiple query fragments from different queries and sessions, to enable analytical HiveQL queries. Within the Cloudera platform, the company says Hive benefits from unified resource management, simplified deployment and administration, and shared security and governance to meet requirements for policy and regulatory compliance.

Impala provides SQL querying capabilities directly on data stored in Hadoop distributed file systems (HDFS), HBase, Amazon S3, ADLS, GCS, and Ozone. Impala possesses a specialized MPP distributed query engine for fast performance and reduced latency. The engine possesses a multicluster structure with auto-scaling and high availability (for load balancing and failover). Metadata management functionality is provided through a component known as the Impala Catalog Service, which relays the metadata changes from Impala SQL statements to all the Impala nodes (via daemons).

Impala is designed to facilitate BI and analytics on Hadoop as well as open file formats in object storage, and it includes dedicated connectors to integrate with BI tools such as Tableau and Power BI. Impala includes fine-grained access controls and user authentication through the Kerberos subsystem, as well as an auditing capability that detects which operations were attempted and whether they succeeded, which is particularly useful in tracking unauthorized access attempts.

CDP also includes an analytics layer that it calls Unified Analytics, which includes semantics commonality for SQL, query results caching, DataSketches functions and rewrites, and other optimizations for Hive and Impala.

The Hive Metastore (HMS) is a repository of metadata for Hive tables and partitions in a relational database and provides clients (including Hive, Impala, and Spark) access to this information using the metastore service API. HMS acts as a central schema repository.

Layered atop all this engine and metastore technology is the aforementioned Iceberg open table format, which brings some relational database properties to the data lake. Iceberg serves as a native table format across the entire CDP, supporting read and write atomicity, consistency, isolation and durability (ACID) compliance, schema and partition evolution, and time travel. An “alter table” statement migrates Hive-managed tables in-place to Iceberg tables by rewriting metadata without regenerating data files. Iceberg implements this technology over the underlying Parquet, ORC, and Avro open table formats, which delivers baseline columnar storage and partitioning.

CDP combines Hive, Impala, Spark, Nifi and Flink with the Iceberg table format, integrating them with Cloudera’s Shared Data Experience (SDX) control plane, and the underlying Apache Ranger and Atlas access control and governance technologies used by SDX, to deliver a full open data lakehouse platform. The addition of Cloudera Data Visualization adds BI capabilities as well for customers who don’t already have a BI platform.

The underlying Apache Hive, Impala, Nifi, Ranger, and Atlas open source components were developed or modernized chiefly by Cloudera, Hortonworks, and XA Secure. Hortonworks acquired XA Secure in 2014, and Cloudera acquired Hortonworks at the beginning of 2019. Apache Iceberg was not developed at Cloudera but is typically used with Apache Parquet, which was initially developed by Cloudera and Twitter. Most data lake platforms are based on at least a subset of these technologies, but Cloudera lays claim to their development and/or modernization, making it a major data lakehouse contender from before the time the term was coined.

That makes CDP a strong choice for a lakehouse platform, even though the company has not explicitly marketed it as such until recently.

Strengths: Cloudera’s strengths include its multiple analytics engines, native Iceberg support, federated querying, advanced security features through integration with Ranger, and high availability.

Challenges: Cloudera’s capabilities are available as embedded functionalities within the CDP, so this offering might not be the best fit for customers with significant investments in other services.

Databricks

Databricks is one of the main proponents of the lakehouse term and paradigm. To this end, it offers the Databricks Lakehouse Platform, which it describes as bringing together the best of both the data warehouse and data lake worlds into one unified platform.

The Databricks Lakehouse Platform is designed to blend the functions of data warehouse and data lake together to enable BI and ML workloads on an organization’s data. It consists of a mixture of open source and proprietary components, including an optimized version of Apache Spark, the proprietary Photon query engine, the Delta Lake open source table format, and the proprietary Unity Catalog for data governance, data sharing, auditing, and logging.

Databricks uses Delta Lake as the default storage format for all operations on its platform. Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Delta Lake was originally developed by Databricks, and the company continues to actively contribute to the open source project. Many of the optimizations and products in the Databricks Lakehouse Platform build upon the technologies in Apache Spark (created at UC Berkeley by Databricks’ founders) and Delta Lake.

In August 2022, Databricks announced the general availability of its Photon query engine, a drop-in replacement for Spark SQL for its platform across all three major clouds. Photon boosts query performance through vector processing and a query optimizer. Photon dynamically modifies query planning at runtime, breaking query plans up into units called tasks that are run on worker nodes, which operate on specific partitions of the data.

The Unity Catalog is Databricks’ data governance solution for its lakehouse. It provides a central place from which admins can manage users and user access across all of an organization’s Databricks workspaces. Unity Catalog provides RBACs, identity management, data lineage, and audit log capabilities for the Databricks Lakehouse.

Databricks SQL warehouses, formerly named SQL endpoints, are dedicated Photon-based compute clusters that allow users to run SQL commands on data objects within the Databricks environment. They are structured as clusters dedicated to on-demand workloads and are isolated for increased concurrency. Users can specify the cluster size by selecting from a list of predetermined sizes (t-shirt sizing). The sizes correspond to a predetermined coordinator size and number of workers. Users can also select other parameters, including whether an idle cluster should stop and, if so, after how long; other options include multicluster load balancing.

By combining Spark, Photon, Delta Lake, cloud storage, and on-demand cluster provisioning, Databricks offers a data lakehouse platform that is fully integrated with the Spark- and notebook-based cloud data engineering, data science, and ML workload facilitation it was initially known for. The concept of teaming data warehouse-like query technology with data stored in open formats, accessible by multiple engines for multiple workloads, is one for which Databricks was the original champion and to which it was the first to apply the data lakehouse term. Though that incumbency no longer grants Databricks exclusivity in the space, it does indicate a great deal of credibility and thought leadership. It also represents significant investment, which should not be underestimated.

Strengths: Databricks’ strengths include its query federation, managed services offering, unified control plane, query acceleration, multicluster structure, and integrations with open source technologies.

Challenges: While Databricks’ platform is deployable as a managed service in any of the three major public clouds, it doesn’t offer any on-premises option at this time.

Dremio

Founded in 2015, Dremio is also a strong proponent of the data lakehouse term and paradigm. Dremio’s platform is a service that sits between data lake storage and end users, and it allows them to directly query data stored in the lake to perform interactive analytics and generate BI dashboards and visualizations. Dremio’s platform consists of two key services: Dremio Sonar and Arctic. Dremio Sonar is a query engine that provides powerful query capabilities on the data lake and a self-service semantic layer. Dremio Arctic is a lakehouse management service that consists of a lakehouse catalog with data-as-code functionality and security and governance capabilities, as well as an upcoming data optimization service that automates Iceberg table maintenance operations.

Dremio’s Sonar query engine is a proprietary columnar query engine with several optimizations for query acceleration, including what the company calls the Dremio Columnar Cloud Cache (C3) and Data Reflections. C3 selectively caches data through individual microblocks within datasets, with the goal of eliminating I/O costs and improving performance. Data reflections, the company says, are optimized materializations of source data or queries, derived from existing tables or views, that precompute aggregations and perform other operations. Dremio’s cost-based query optimizer uses data reflections to accelerate queries by transparently substituting them into the query plan and using them to satisfy all or part of a query, rather than having to fully process the raw data in the underlying data source. Transparent query acceleration enables end users to run BI workloads solely by working with their logical view of data, without having to manage or adjust physical structures such as materialized views and indexes to improve dashboard performance.

Sonar makes use of Apache Arrow, an open source project that was developed by Dremio and others to provide a unified format for in-memory representation of columnar data, eliminating the need for data to be serialized and deserialized among multiple formats. By leveraging Arrow Gandiva, a low-level-virtual-machine (LLVM)-based analytical expression compiler for Apache Arrow, Dremio compiles queries to hardware-optimized native code, maximizing CPU utilization with vectorized execution directly on Arrow buffers. Arrow Flight, an open source remote procedure call framework, enables fast data transfer between data systems, as an alternative to JDBC/ODBC connectors.

Dremio’s optimizations for concurrency include the ability to provision multiple right-sized, physically isolated engines (clusters, essentially) to enable workload isolation and high concurrency. In an AWS deployment, engines are made up of one or more EC2 instances and are automatically started and stopped by the Sonar control plane. Queries are assigned to different engines based on routing rules or user selection, and all engines are physically isolated so concurrent workloads don’t impact or interrupt each other. Supported data lake file and table formats include CSV, JSON, ORC, Parquet, Delta Lake, and Iceberg. Dremio’s query engine also supports analyzing and joining data residing in various non-lake sources, including relational databases and NoSQL sources. Dremio’s connectors to external sources benefit from Dremio’s query acceleration, as well as additional optimizations including query pushdown (such as predicate and projection pushdown) and runtime filtering.

Dremio Sonar’s semantic layer enables self-service analytics for business users and centralizes access controls and governance for administrators. Data administrators can use the semantic layer to define a consistent and secure view of data and business metrics that can be leveraged by any downstream application. In addition, the semantic layer provides a self-service approach for non-technical users to discover, analyze, curate, and share datasets through SQL or a point-and-click UI. Native connections with BI tools, including Tableau and Power BI, provide analysis and enable users to generate BI dashboards directly from the data in the data lake. Security features include granular access management, audit logging capabilities, authentication, and integration with common identity providers—like Azure Active Directory, Okta, and Google—for additional fine-grained access controls and data encryption.

Dremio Arctic is a lakehouse management service within Dremio Cloud, powered by the Nessie open source project, and it is designed to provide a cloud-based alternative to the Hive metastore. The vendor says that key features of Dremio Arctic are its lakehouse catalog, which uses data as code capabilities to simplify data management and deliver a consistent view of the data to all end users, the ability to use any processing engine–including Dremio’s Sonar, Flink, Presto, and Spark–and its upcoming data management automation capabilities including table optimization and table cleanup.

Dremio can be deployed in the AWS cloud as a SaaS offering called Dremio Cloud (with support for Azure coming soon). Dremio can also be deployed in GCP, Azure, or on-premises as a customer-managed solution.

The platform architecture for Dremio Cloud (Dremio’s SaaS offering) consists of the Dremio control plane, hosted in Dremio Cloud and functioning as a single pane of glass from which queries are planned and engines are managed, and the execution plane, hosted in the customer’s virtual private cloud (VPC), where data is stored and where compute engines that are responsible for query execution reside. If multiple cloud accounts are used with Dremio Cloud, each VPC acts as an execution plane.

Strengths: Dremio’s strengths include its self-service semantic layer, query acceleration, support for Iceberg and Delta Lake file formats, multicluster structure and optimizations for concurrency, and advanced security and governance features.

Challenges: Dremio Cloud is available only on the AWS Cloud at the moment. However, the company indicates that planned support for Azure is forthcoming.

Google

Google offers a number of services that enable customers to build a data lake or lakehouse that integrates with an organization’s existing applications, infrastructure, and investments. For this evaluation, we looked at:

  • Google BigQuery, which can be configured for data lake or lakehouse use cases, (including the new Google BigLake storage engine and tables that enable unification of data lakes and data warehouses).
  • Dataproc, a managed Apache Spark and Hadoop service (with access to other open source engines including Presto and Apache Flink) for optimizing batch processing, querying, streaming, and ML.
  • Dataplex, a recently released solution that the vendor says helps customers address data fabric and data management needs and build domain-specific (data mesh) architectures across data in Google Cloud Storage and BigQuery.

Google BigQuery is a serverless solution that is officially Google’s data warehousing solution but can be configured for different use cases, including as a data lake and lakehouse. BigQuery’s serverless nature results in low maintenance and overall ease-of-use. BigQuery provides capabilities for querying large-scale datasets, native integration with Google’s in-house BI tools (Looker, the BigQuery BI Engine), and ML and predictive modeling capabilities through the built-in BigQuery ML feature as well as through integrations with Vertex AI and TensorFlow. Cross-cloud data sharing and analytics across Google Cloud, AWS, and Microsoft Azure is enabled through BigQuery Omni.

Google BigLake, now generally available, provides a unified interface and semantics to allow users to access data regardless of its storage format or where it physically resides. BigLake extends BigQuery to open file formats such as Parquet and ORC, on public cloud object stores. BigLake also recently extended support to open table formats by adding support for Apache Iceberg.

BigLake tables are an enhanced version of external tables that allow governance and access-controls to be applied at the table, row, or column level. BigLake connectors to common open source query engines, including Spark, Trino, Presto, and Hive, allow these engines to query data in BigLake tables with consistent governance policies applied across engines. The result of these capabilities is that, to the user, data can be accessed using the BigQuery interface without regard to where the data actually resides—whether in a native BigQuery table or as an object in cloud storage—and governance and access controls remain consistent as well.

Dataproc is a fully managed service designed to help organizations optimize and apply common open source tools, algorithms, and programming languages to large-scale datasets. It provides support for optimizing Apache Spark, Hadoop, Flink, Presto, and over 30 other open source tools and frameworks.

Dataproc can be deployed in serverless mode or on managed clusters on Google Compute or Kubernetes. The serverless mode eliminates the need to manage infrastructure, provisioning, and tuning. Alternatively users can configure resizable clusters with various virtual machine types, disk sizes, number of nodes, and networking options. Clusters can be managed through a web interface, Google Cloud SDK, RESTful APIs, and secure shell (SSH) access. An optional high-availability mode, with multiple master nodes, allows users to set jobs to restart on failure. Scheduled idle cluster deletion allows users to delete a cluster after a specified cluster idle period, at a specified future time, or after a specified time period has elapsed. The Dataproc Metastore provides metadata management, as well as security and access management capabilities through Google Identity and Access Management and Kerberos authentication.

The vendor highlights a use case of Dataproc as helping customers migrate existing on-premises Apache Hadoop and Spark clusters over to Dataproc for the scaling and elasticity benefits provided by the cloud.

Lastly, Dataplex is Google’s recently released solution that the vendor says enables organizations to discover, manage, monitor, and govern data. Dataplex supports multiple use cases and architectures including domain-oriented architectures (data meshes) or data fabric architectures. Users can logically organize all data in a particular domain as a set of Dataplex Assets within a data lake, without having to move data into a separate storage system. Assets can be grouped regardless of data type (structured and unstructured) and physical location (Google Cloud Storage or BigQuery)—purely based on business context. Dataplex provides metadata management and data access policies to allow administrators to manage governance and permissions across the domain. Dataplex integrates with Stackdriver for audit logs and data metrics.

Strengths: From a use case perspective, Google provides customers with solutions to enable many different use cases, including data lakes, data lakehouses, and data meshes, along with the traditional data warehouse. Its products’ serverless nature (or options) increase overall ease of use. Other strengths include the unified interface provided by BigQuery and BigLake, strong query federation capabilities, open source technology integration, and scalability and elasticity.

Challenges: Many extended functionalities of Google’s solutions derive from native integrations with other elements of Google’s suite of products (though the platform can query data on-premises and in other clouds), so this is not the best solution for companies with significant investment in other cloud platforms.

Hewlett Packard Enterprise (HPE)

HPE is the technology software and consulting business that emerged from the split up of the Hewlett-Packard Company in 2015. In 2019, HPE acquired MapR, one of the three original Hadoop companies, and has been working to integrate, optimize, and streamline that company’s offerings for enterprise customers. Software from HPE’s portfolio relevant to this report includes components of the HPE Ezmeral suite: specifically, the foundational HPE Data Fabric, from which data storage is managed and orchestrated; and the current HPE Ezmeral Ecosystem Packs, which allow query engines to be run on the data in the underlying storage.

Foundational to HPE’s data lake offerings is the HPE Ezmeral Data Fabric, a self-managed, scalable data storage offering that derives its powerful and far-reaching federation capabilities from the legacy of the MapR File System (MapRFS). At a high level, Ezmeral Data Fabric allows users to manage and organize data across clouds, on-premises, and at the edge into one logical view, which the vendor calls a global namespace. The global namespace allows users to create a logical hierarchy or grouping to federate data that can be found in multiple databases and in physically disparate locations: for example, in AWS, in Azure, and/or on-premises. This hierarchy thus presents the data in those disparate locations to the user in such a way that the user sees it all together in one array of storage.

At the base of this hierarchy, data can be stored in Volumes (instances of the MapR file system), in Buckets (S3 buckets or object storage), and/or in Topics (Kafka topics or other data streams). Volumes, buckets, and topics are then organized under a data fabric, which can be deployed in the AWS Cloud, Azure Cloud, or at the edge, with support for deployment on-premises on HPE’s GreenLake infrastructure forthcoming, as well as support for deployment in the Google Cloud. At the top of the hierarchy, above the data fabrics, is the global namespace. A user can create multiple data fabrics under one namespace, allowing them to federate and manage data in multicloud, hybrid cloud, and/or on-premises scenarios as well. Customers can also have multiple namespaces: for example, one namespace for a production environment and others for test and development environments.

As mentioned before, Ezmeral Data Fabric provides the foundation for HPE’s data lake offerings. At the time of the writing of this report, HPE provides the HPE Ezmeral Ecosystem packs for the actual querying of data. These ecosystem packs allow a number of open-source query engines to be downloaded, installed, and run against the customer’s data. Supported query engines in the ecosystem packs currently include Spark (SparkSQL), Drill, and Hive (on the Apache Tez framework). Combined with the powerful federation capabilities of the Data Fabric, HPE’s offerings allow its engines to reach data in very disparate locations, across clouds, the edge, and on-premises. An upcoming additional component of the HPE Ezmeral suite, Ezmeral Unified Analytics, will provide a fully-managed, cloud-based, containerized, software-as-a-service option for deploying these engines on data in the data lake.

Strengths: HPE Ezmeral Data Fabric is bolstered by the underlying capabilities of the MapR File System, which allow it to present a unified view of data across edge, cloud and on-premises environments and provide its query engines with far-reaching federated querying capabilities.

Challenges: Currently only the AWS and Azure clouds are supported by HPE Ezmeral Data Fabric; however, the vendor has noted that support for Google Cloud is forthcoming. At the time of the writing of this report, customers must download and manage the installation of open-source engines to run against data stored in the Data Fabric; however, the upcoming release of the Unified Analytics component of the Ezmeral suite should streamline this process greatly.

IBM

IBM’s Cloud Object Storage is the technology giant’s data lake storage offering. IBM’s Cloud Data Engine (formerly IBM Cloud SQL Query) is the vendor’s service that allows users to query data in IBM Cloud Object Storage and Kafka using SQL. IBM Db2 Big SQL is the vendor’s SQL-on-Hadoop tool that enables querying of HDFS, relational database management systems (RDBMSs), NoSQL databases, object stores, and WebHDFS through a single connection.

The IBM Cloud Data Engine is based on Apache Spark and provides a serverless SQL querying service for data stored in IBM Cloud Object Storage. In addition to querying, Data Engine also provides streaming data ingestion, data preparation, and ETL (extract, transform, load). Data Engine can directly query CSV, JSON, ORC, Parquet, or Avro files residing in Cloud Object Storage. Metadata is managed in a catalog compatible with the Hive metastore. The Data Engine UI provides a query editor and a REST API from which users can write and run queries and view their status, as well as automate queries. Data Engine also includes geospatial and time series data functionality, and it integrates with IBM Cloud Functions and IBM Watson Studio.

In case of any failures or outages, workloads can be rerouted to different regions by creating new instances of Cloud Object Storage in an available region, and frequent automated backups provide recovery points. Security features include IBM Identity and Access Management and IBM Key Protect, as well as granular access controls for Cloud Object storage buckets.

Data Engine can query data only in IBM Cloud Storage instances. To help users leverage its Cloud Object Storage, IBM offers a high-speed data transfer service based on its IBM Aspera technology, and the IBM Cloud Mass Data Migration service to help users transfer data from on-premises sources to the cloud.

IBM Db2 Big SQL is a massively parallel processing SQL-on-Hadoop query engine that offers federated querying across HDFS, IBM general parallel file system (GPFS), WebHDFS, relational databases with IBM Fluid Query, NoSQL databases, and object stores. It is available in two variations: integrated with the Cloudera Data Platform, or as a cloud-native containerized service on IBM Cloud Pak for Data.

The Db2 Big SQL console provides a unified control plane from which administrators can perform system monitoring and a query editor from which users can run SQL queries and view the status of queries run. The home page of the console provides dashboards presenting an overview of system health, including information on system availability, responsiveness, throughput, and resource usage. Admins can drill down from the dashboards to investigate anomalies. Db2 Big SQL integrates with data visualization tools including IBM Watson Studio Local, Tableau, and Cognos. It also integrates with Apache Ranger for authorization and auditing features.

Strengths: IBM offers several options to users in this landscape. Strengths of IBM’s offerings include platform longevity and integration with open source technology; individual strengths of IBM Cloud Data Engine include scalability/elasticity, high availability, serverless operation and a unified control plane; and a strength of IBM Db2 Big SQL is federated querying.

Challenges: Data Engine can query data only in IBM Cloud Storage instances; however, it does offer data transfer services to assist users with moving their data to Cloud Object Storage to mitigate this limitation.

Microsoft

One of the core tenets of data lakes and lakehouses is that they use standard storage and file formats, allowing a number of engines to share and process data. That principle figures prominently in Microsoft’s case because its full data lake portfolio includes a number of products, including two components of Azure Synapse Analytics (Serverless SQL Pools and Spark Pools), Azure Databricks, and even many components of the older Azure HDInsight big data service. All of these services can process and query data stored in Azure Data Lake Storage and in Azure Blob Storage.

Readers should understand that when that technology runs on Azure, it is supported by Microsoft and integrates with numerous Azure cloud services. To appreciate Microsoft’s full data lake/lakehouse value proposition, readers should consider the capabilities of Synapse Analytics, Azure Databricks, and aspects of HDInsight, together. However, Synapse Spark pools, based on Apache Spark’s well-documented capabilities, and the capabilities of the Databricks platform are covered separately in this report.

In this review, we focus on Azure Synapse Analytics Serverless SQL, which we see as a highly strategic data lakehouse platform for Microsoft, functionally comparable to many other products reviewed in this report, but leveraging the T-SQL language and SQL Server query engine technology the company has been engineering and improving for decades.

Serverless SQL pools can query data in the lake via the creation of external tables, or dynamically via the OPENROWSET command. The serverless pools’ query engine, originally developed under the codename Polaris, is a specialized, distributed version of the SQL Server/Azure SQL engine. In addition to external tables, serverless pool databases can contain views, stored procedures, scalar functions, and table-valued functions. Although they contain no materialized tables, serverless pool databases otherwise function as if they were databases in SQL Server (or Azure SQL or Synapse dedicated pools). They are likewise compatible with the same external tools and APIs. This combination of capabilities makes serverless SQL pools a full-fledged data lakehouse platform.

Serverless SQL pools are built for large-scale data querying against an organization’s data lake. Queries can be run against the data stored in Azure Data Lake Storage, Azure Cosmos DB (via Azure Synapse Link for Cosmos DB), or Microsoft Dataverse. Supported file formats include Parquet, Delta Lake, CSV, and JSON. Users can also query external tables using Synapse Spark pools and Spark SQL.

As the name indicates, serverless SQL pools are serverless, so there is no infrastructure to set up or maintain. A default serverless SQL endpoint is provided within every Azure Synapse workspace, so users can begin querying data as soon as the workspace is created. Additional serverless pools can be created if desired, which can be handy for establishing workload management, security boundaries, or other compartmentalization constructs. Users can query the data from the native Synapse Studio SQL query editor or use dedicated drivers to query the data from an external query or BI tool or from custom applications. Serverless SQL pools can service client tools including Azure Data Studio (ADS), SQL Server Management Studio (SSMS), and Power BI, and connections with other tools are available through ODBC/JDBC, ADO.NET, and PHP drivers. Virtually any tool that can connect to Microsoft’s SQL Server database can connect to Synapse serverless SQL pools as well.

Security features include basic RBACs, and an integration with Azure Active Directory provides an interface for permissions management, single sign-on, and multifactor authentication. Administrators can grant, deny, and revoke permissions at the object level. Metadata management and governance features are also available through integration of a Microsoft Purview account with the user’s Synapse Analytics workspace.

In addition to serverless SQL pools, Synapse dedicated SQL pools and Synpase Spark pools can query Synapse’s data lakes.

Strengths: The strengths of this solution include the ease of maintenance and setup resulting from its serverless nature, its enterprise-ready SQL Server query engine technology, scalability and elasticity, high availability, and its unique approach to the lake versus warehouse question.

Challenges: This solution would best suit those organizations with existing Microsoft products and infrastructure because only data stored in Azure data stores can be queried, and a number of the capabilities are optimized through integration with other elements of Microsoft’s stack (for example, metadata management through Microsoft Purview and enhanced security and permissions management through Azure Active Directory).

Oracle

Like other technology giants in this report, Oracle offers a number of solutions and options within its technology stack to address use cases that are relevant to customers in the data lake/lakehouse space.

Oracle Cloud Infrastructure (OCI) Object Storage is its cloud storage platform for unstructured data of any content type. Users can leverage OCI Data Integration, OCI GoldenGate, or OCI Streaming to ingest or transfer data into OCI Object Storage. The vendor promotes the OCI Object Storage as designed with security-first architecture and features including isolated network virtualization and a dedicated storage space that is unique to each customer, which is designed to improve security and reduce the risk of exposure. Continuous threat assessment through Oracle Cloud Guard continuously monitors for anomalous events. OCI Identity and Access Management allows users to manage identities and grant different types of access to resources, as well as integrating with SAML directory providers, including Azure Active Directory and Okta.

For querying, Oracle offers OCI Data Flow, a fully managed service for Apache Spark, designed to perform processing on very large datasets with overall hands-free administration. REST APIs allow integration with and creation of Apache Spark applications using SQL, Python, Java, Scala, or spark-submit. A dashboard provides a view of resource usage to assist in managing costs. Data Flow supports the Delta Lake format by default; no extra configuration is required.

Oracle Autonomous Database also includes features that enable SQL Analytics across data lakes on Oracle Cloud Infrastructure as well as Amazon, Azure, and Google, in what the vendor promotes as its data lakehouse solution. Analytics capabilities include ML, graph, and spatial analysis, as well as allowing any SQL-based BI tool or application to access the data regardless of where it is stored. File types supported include JSON, CSV, Parquet, Avro, ORC, and Delta Lake.

At the time of the writing of this report, Oracle MySQL Heatwave Lakehouse is in beta. According to the vendor, it will enable customers to query large data volumes in a variety of file formats, such as CSV and Parquet, and Aurora and Redshift backups. The scores in this review focused on the products that are currently in general availability.

Strengths: Oracle’s strengths include its ease of use, platform longevity, scalability and elasticity, query federation, and open source technology integration.

Challenges: These solutions may appeal less to customers who are not already running other Oracle solutions or MySQL databases, or who do not already possess existing Oracle infrastructure.

Starburst

Founded in 2017, Starburst is a vendor whose platform aims to help companies optimize the Trino SQL query engine for their needs. Trino, formerly known as PrestoSQL, evolved from the original Presto project at Facebook and was rebranded as Trino in late 2020. Trino is a highly parallel open source distributed SQL query engine designed to perform analytics on large volumes of data.

Starburst’s platform includes two options: Starburst Galaxy, the vendor’s cloud-native, fully managed SaaS platform; and Starburst Enterprise, a self-managed solution deployable in on-premises, hybrid, or cross-cloud scenarios, that provides an enhanced distribution of the open source Trino engine. The vendor says that Starburst Galaxy is ideal for organizations without the expertise of data engineers and platform administrators because administrative tasks are handled for the customers, and there is a clean, easy-to-use interface. Starburst Enterprise requires admins and/or data engineers to create, configure, and maintain clusters, as well as set up the connectors to data sources, and thus may suit organizations with more advanced knowledge and skill sets.

Starburst Galaxy consists of a data consumption layer that sits between data in object storage, warehouse, or relational databases on one end, and existing analytics tools on the other, allowing business users to run BI workloads, data engineers to build pipelines, or data scientists to run AI/ML workloads using the tools and languages of their choice, including Power BI, Tableau, Looker, ThoughtSpot, Superset, dbt, Metabase, Python, Jupyter, and R, as well as other tools with standard ODBC/JDBC.

Starburst Galaxy clusters can be configured in t-shirt sizes with clusters able to be stopped, suspended, and restarted. Support for automatic cluster scaling is currently in public preview. Query federation capabilities allow Starburst Galaxy to run SQL queries across multiple data sources.

Starburst Enterprise includes more than 50 supported data connectors, including a combination of Starburst-exclusive and open source Trino connectors. These connectors include object storage, data warehouses, and relational database sources. Connectors allow users to query data from various sources through a single SQL interface. The source data itself is never actually stored within the platform, but once a connector has been registered, the data in the source immediately becomes available through an ODBC/JDBC connection, command line interface, or programmatic interface. A new feature called “Great Lakes connectivity” provides Iceberg, Delta Lake, Hive and Hudi as options for table formats.

Starburst offers a query acceleration feature called “Warp Speed,” a byproduct of its acquisition of Varada. The feature is able to both index and cache portions of the lake and, the vendor says, can lower query execution time on the lake by somewhere between 2x and 7x. It is generally available for Starburst Enterprise only; it is in private preview for Starburst Galaxy.

Both Starburst Galaxy and Starburst Enterprise provide built-in RBACs for data sources and audit logs of actions performed through this functionality. Starburst Enterprise possesses the alternative for RBACs to be provided through integrations with Privacera, Immuta, or Apache Ranger. Starburst Enterprise includes query logging capabilities, from which administrators can access details of query history, including single-query statistics and query plans. In Starburst Enterprise, the Enterprise UI provides a graphical interface from which to view metrics and information about cluster performance, query history, query statistics, and more.

To support the lakehouse paradigm, the MERGE statement, recently introduced into the Trino engine, modifies an existing table based on the result of a comparison between the key fields and another table, tries to compare the source table with the target tables based on a key field, and then processes the changes, which the vendor says provides users with even more data warehouse-like capabilities on the data lake.

Starburst’s Galaxy and Enterprise may at first appear to be standalone federated query engines, but the combination of MPP operation, optimized query planning, multicluster operation, auto-scaling, and the MERGE capability, connectivity to data lakes with support for data in Parquet, Iceberg, Delta Lake, and Hudi formats, and compatibility with all major BI tools make Starburst a true lakehouse platform.

Strengths: Strengths of Starburst include query accelerations, federated querying, scalability and elasticity, high availability, support for multiple file and table formats, both managed services and self-managed offerings, a multicluster structure, a unified control plane, and the robust documentation and community support provided by the open source Trino community as well as from the vendor itself.

Challenges: Although Starburst Galaxy is the vendor’s managed services offering, cluster and data source connections do need to be configured by the user at least on a basic level, so a certain amount of expertise is still required. However, the Starburst and Trino community provides robust documentation and a wide network of support, so there are many resources available to assist.

6. Analyst’s Take

While data lakes have been around since the early days of big data, existing technologies are steadily improving and new capabilities and paradigms continue to be established. The data lake and data warehouse battle continues to intensify, as the quest for and development of the ultimate data storage, analytics, and management solution is never complete.

All the vendors in the landscape provide robust solutions, and all have their differentiating characteristics and areas in which they excel. This report is intended to inform you about the products available and help you decide on the evaluating criteria and metrics that are most relevant, so you can use them to select the offering that best matches your organization’s needs.

Key takeaways include:

  • Among the data lake query engines there are different approaches: some are based on optimizing an underlying open source query engine; some possess their own proprietary engine; some make use of data virtualization, some function as an “orchestration” layer coordinating multiple applications’ access to multiple storage systems; and some present serverless options.
  • The vendor landscape for data lakes is currently in its third generation, marked by sophisticated query engines, support for open file and table formats, and the rise of the lakehouse moniker. More and more vendors are marketing their platforms as lakehouses, although there is no clear-cut delineation between data lake and lakehouse. Overall, data lake technology has emerged in this generation as refined and enterprise-ready.
  • Some vendors include query acceleration within their engines, to assist with improving the speed and execution of queries. In some cases, the engine generates multiple query plans and selects the best option from among them, and/or modifies query planning at runtime for optimal speed and performance. In some cases, the engine breaks up query plans into smaller units called tasks, run on worker nodes, that operate on a specific partition of the data. Some vendors also include additional structural optimizations within the engine, including caching of frequently accessed data to assist with query speed and precomputations of aggregations or other operations that the engine can then use to satisfy all or part of a query instead of needing to process the raw data at query time.
  • Among the different offerings, two general approaches can be seen: some vendors, such as Oracle, IBM, Databricks, and all three major cloud providers, provide their own data lake storage in addition to a query engine. Others, such as Ahana, Starburst, and Dremio provide a query engine and allow customers to connect to data in the source of their choosing. Customers should consider factors such as how such offerings would harmonize with their existing infrastructure when evaluating solutions.
  • There has been a rise in vendors introducing support for open table formats, including Apache Iceberg, Apache Hudi, and Delta Lake, as they seek to extend the functionality of their data lake offerings and bring data warehouse-like analytics functionality to the data lake.
  • Some vendors have begun to embrace the concept of the “data mesh,” a paradigm shift away from the centralized models of both data warehouses and data lakes. These vendors are introducing features that support domain-specific architectures. This includes allowing data to be organized by logical domain rather than physical domain, and embraces the concept of data products— self-contained units of data produced under the ownership of a team from start to finish and designed to be presented to and used by a data consumer.
  • When choosing the best product for an organization’s needs, it is very important not to make a selection based on its technology but rather on the specific use case the product needs to address because that alignment will be the best predictor of future success.

7. About Andrew Brust

Andrew Brust has held developer, CTO, analyst, research director, and market strategist positions at organizations ranging from the City of New York and Cap Gemini to GigaOm and Datameer. He has worked with small, medium, and Fortune 1000 clients in numerous industries and with software companies ranging from small ISVs to large clients like Microsoft. The understanding of technology and the way customers use it that resulted from this experience makes his market and product analyses relevant, credible, and empathetic.

Andrew has tracked the Big Data and Analytics industry since its inception, as GigaOm’s Research Director and as ZDNet’s original blogger for Big Data and Analytics. Andrew co-chairs Visual Studio Live!, one of the nation’s longest-running developer conferences, and currently covers data and analytics for The New Stack and VentureBeat. As a seasoned technical author and speaker in the database field, Andrew understands today’s market in the context of its extensive enterprise underpinnings.

8. About GigaOm

GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.

GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.

GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.

9. Copyright

© Knowingly, Inc. 2023 "GigaOm Radar for Data Lakes and Lakehouses" is a trademark of Knowingly, Inc. For permission to reproduce this report, please contact sales@gigaom.com.