-- TCO & Benchmark

# Cloud Data Warehouse Performance Testing

## 1. Summary

Big data analytics platforms load, store, and analyze volumes of data at high speed, providing timely insights to businesses. Data-driven organizations leverage this data, for example, for advanced analysis to market new promotions, operational analytics to drive efficiency, or for predictive analytics to evaluate credit risk and detect fraud. Customers are leveraging a mix of relational analytical databases and data warehouses to gain analytic insights.

This report focuses on relational analytical databases in the public cloud because deployments are at an all-time high and poised to expand dramatically. The cloud enables enterprises to differentiate and innovate with these database systems at a much more rapid pace than was ever possible before. The cloud is a disruptive technology, offering elastic scalability vis-à-vis on-premises deployments, enabling faster service deployment and application development, and allowing less costly storage. For these reasons and others, many companies have leveraged the cloud to maintain or gain momentum.

This report outlines the results from an analytic performance test derived from the industry-standard TPC Benchmark™ DS (TPC-DS) to compare Cloudera Data Warehouse service (CDW)—part of the broader Cloudera Data Platform (CDP)—with four prominent competitors: Amazon Redshift, Azure Synapse Analytics, Google BigQuery, and Snowflake. Overall, the test results were insightful in revealing query execution performance of these platforms.

In terms of price per performance, Cloudera ran the Field Test 20% more cost-effectively than the nearest competitor, Amazon Redshift, 40% more cost-effectively than Azure Synapse, and 80% more cost-effectively than Snowflake. Cloudera ran the Field Test 5.5 times more cost-effectively than Google BigQuery.

Introduction

Performance is important but is only one criterion for a data warehouse platform selection. This is only one point-in-time check into specific performance. There are numerous other factors to consider in selection across factors of administration, integration, workload management, user interface, scalability, vendor, reliability, and numerous other criteria. It is also our experience that performance changes over time and is competitively different for different workloads. Also, a performance leader can hit up against the point of diminishing returns and viable contenders can quickly close the gap.

GigaOm runs all of its performance tests to strict ethical standards. The results of the report are the objective results of the application of queries to the simulations described in the report. The report clearly defines the selected criteria and process used to establish the field test. The report also clearly states the data set sizes, the platforms, the queries used, and more. The reader can determine how to qualify the information for individual needs. The report does not make any claim regarding third-party certification and presents the objective results received from the application of the process to the criteria as described in the report. The report strictly measures performance and does not purport to evaluate other factors that potential customers may find relevant when making a purchase decision.

This is a sponsored report. Cloudera chose the competitors, the test, and the Cloudera cluster size. The default configurations were chosen. GigaOm set up the environments and ran the queries. Choosing compatible configurations is subject to judgment. We have attempted to describe our decisions in this paper.

In this writeup, all the information necessary is included to replicate this test. You are encouraged to compile your own representative queries, data sets, data sizes, and compatible configurations and test for yourself.

## 2.Cloud Analytics Platform Offerings

This report outlines the results from an analytic performance test derived from the industry standard TPC Benchmark™ DS (TPC-DS) to compare Cloudera Data Warehouse (running Impala-based virtual warehouses), Amazon Redshift, Azure Synapse Analytics, Google BigQuery, and Snowflake—five relational analytical databases based on scale-out cloud data warehouses and columnar-based database architectures. Despite these similarities, there are some distinct differences among the platforms.

Cloudera Data Warehouse (part of Cloudera Data Platform)
Cloudera Data Warehouse (CDW) was the only platform offering we tested that boasts flexibility through support for both data center and multiple public cloud deployments, as well as capabilities across analytical, operational, data lake, data science, security, and governance needs.

CDW is part of CDP, a secure and governed cloud service platform that offers a broad set of enterprise data cloud services with the key data functionality for the modern enterprise. CDP was designed to address multi-faceted needs by offering multi-function data management and analytics to solve an enterprise’s most pressing data and analytic challenges in a streamlined fashion.

The architecture and deployment of CDP begins with the Management Console, where several important tasks are performed. First, the preferred cloud environment (for example, AWS or Azure) is set up. Second, data warehouse clusters and machine learning (ML) workspaces are launched. Third, additional services, such as Data Catalog, Workload Experience Manager, and Replication Manager are utilized, if required.

The Cloudera Data Warehouse service provides self-service independent virtual warehouses running on top of the data kept in a cloud object store, such as S3. The virtual warehouses use Data Catalog as a logical collection of metadata to define the managed data and its business context. Multiple virtual warehouses can share a single Data Catalog. The advantages of this virtual warehouse architecture include isolation and automatic configuration. Virtual warehouses and their compute resources are isolated to prevent “noisy neighbors” or resource hog queries bogging down a conventional monolithic data warehouse. Virtual warehouses also have automated management and performance scaling features such as auto-scaling, auto-suspend, and auto-resume. These features simplify capacity planning, adjust compute capacity to workload requirements, and ensure that you only pay for what you need to run your queries.

Amazon Redshift
Amazon Redshift was the first managed data warehouse cloud service and continues to hold a high level of mindshare in this category. It indeed ticks all the table stakes boxes for a cloud analytic database, and Redshift is a fit for organizations that need a data warehouse with little to no administrative overhead and a clear, consistent pricing model. Redshift (when run without Spectrum) is different from the other competitors in this report in that it doesn’t read directly from cloud object storage during query processing (a separate loading step is required).

We tested the latest RA3 Redshift engine, which introduced a new managed storage layer that is an upgrade from the more tightly coupled storage on the older DS2 and DC2 instance types. For Redshift, we paid an hourly rate for when the cluster was running, but it also provides a pause feature to stop billing as needed. Even with managed storage, Redshift, as noted just above, doesn’t read directly from object storage (such as S3) during query processing. Data must instead be explicitly loaded from S3 into the managed storage, and the time required for that loading step was not included in the benchmark results.

Azure Synapse Analytics
On Azure Synapse Analytics, formerly known as Azure SQL Data Warehouse, storage is separate from the compute Data Warehouse Unit (DWU). This enables Azure Synapse to scale columnar storage capacity and compute resources independently, making it possible to adjust to changing workload demands and offering potential cost savings when demand is low. Synapse can pause and resume compute billing, where only storage is billed during the paused time. Synapse achieves a good balance in both configurability and simplicity, making it both easy to administer and flexible in handling almost any usage pattern.

With Synapse, you can scale the compute DWU on the fly. We paid an hourly rate for when our cluster was active, but there is also a separate data storage charge for the SQL database underneath the Synapse engine.

Google BigQuery is a managed service with some interesting distinctions. Google abstracts the details of the underlying hardware, database, and all configurations, and has designed BigQuery as a hands-off database without indexes or column constraints. Defragmentation and system tuning are not required. A serverless system, Google Cloud manages servers in a fully hands-off manner to the customer, dynamically allocating storage and compute resources. The customer does not define nodes or the capacity of the BigQuery instance. Compute provisioning is particularly fast and seamless.

Billing is based on the amount of data you query and store. At the time of this writing, customers can pre-purchase flat-rate computation “slots” or units in increments of $10,000 per month per 500 compute units. However, Google recently introduced Flex Slots, which allow slot reservations as short as one minute and billed by the hour. There is a separate charge for active storage of data. Snowflake A cloud-only, fully managed solution that can run on AWS, Azure, or Google Cloud Platform, Snowflake offers a clear separation between compute and storage. For Snowflake on AWS, which is what we used for the queries in our benchmark, data is stored in AWS S3 and cached when queries are executed to bring the data in closer proximity to compute resources. Snowflake essentially offers two configuration “levers” — the size of the warehouse cluster and the number of clusters permitted to spin up to handle concurrency. Snowflake scales cluster server count by powers of 2 (i.e., 1, 2, 4, 8, 16, and so on). If enabled, Snowflake spins up additional clusters to handle multi-user concurrent query workloads, and automatically spins down the additional clusters once demand has passed. If concurrency is not enabled, Snowflake will place paused queries in a queue until resources become available. For Snowflake, you pay a flat hourly fee for when compute resources are being used. We paid$3.00 per hour for the Standard+ tier. Once the compute warehouse goes inactive, you no longer pay. However, there is a separate charge for data storage.

## 3.Test Setup

The data sets used in the test were a workload derived from the well-recognized industry standard TPC Benchmark™ DS (TPC-DS).

From tpc.org: “The TPCDS is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general-purpose decision support system.… The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. TPC-DS Version 2 enables emerging technologies, such as Big Data systems, to execute the benchmark.”

The parameter values for the queries used across all vendors are from the TPC Benchmark™ DS (TPC-DS) 2.13 spec validation queries. This is not an official TPC benchmark. The queries were executed using the following setup, environment, standards, and configurations.

The data model consists of 24 tables—7 fact tables and 17 dimensions. To give an idea of the data volumes used in our field test, the following table gives row counts of fact tables in the database when loaded with 30TB of data:

Table 1. Database Row Count

### Queries

The testing suite has 99 queries—4 of which have two parts (14, 23, 24, and 39). This brings a total of 103 queries. The queries used for the tests were compliant with the standards set out by the TPC Benchmark™ DS (TPC-DS) specification and included only minor query modifications as set out by section 4.2.3 of the TPC-DS specification document. For example, minor query modifications included vendor-specific syntax for date expressions. Also in the specification, some queries require row limits and, thus, vendor specific syntax was used (for example TOP, FIRST, LIMIT, and so forth) as allowed by section 4.2.4 of the TPC-DS specification.

### Cluster Environments

Our benchmark included five different cluster environments, which are shown here. The cluster sizes were chosen to achieve similar hourly costs for each vendor, to the extent possible.

Table 2. Platform Summary

## 4.Test Results

This section analyzes the query results from the execution of the test queries (derived from the TPC-DS) described above. The primary metric used was the best aggregate total of each of the three runs. Three power runs were completed on each platform. Each of the 99 queries was executed three times in order (1, 2, 3) against each vendor cloud platform.

Amounts shown in Table 3 represent the cost in USD for running each individual query. Red represents the most costly and green (often faint) is the least costly (best). The “Cl” column is Cloudera, “BQ” is BigQuery, “RS” is Redshift, “SF” is Snowflake, and “Sy” is Synapse.

Table 3. Price-Performance Field Test Results for all Queries

## 5.Price-Performance

To conduct comparative performance testing, we typically make the attempt to align the hardware and software as much as possible between the platforms. However, achieving a same like-for-like configuration with fully-managed cloud data warehouse platforms is very difficult. Thus, we aligned as closely as we could on price-per-hour as a basis for parity. System cost can be a difficult aspect to compare among systems, because vendor platforms vary in their pricing and licensing models. However, all platforms present consistent, on-demand hourly cloud pricing that we can use to determine price per performance.

Cloudera has a clear pricing model. However, there are both software costs (for the use of Cloudera Data Warehouse) and infrastructure costs (in our case, Amazon Web Services EC2 instances). For the Cloudera software usage of the platform, there is a $0.72 per executor node (64 total) per hour. On the infrastructure side, there were 64 r5d.4xlarge executor nodes plus two coordinator nodes of the same type—for a total of 66 at$1.152 per hour in US East (at the time of our testing). There are also three m5.2xlarge management nodes that serve as orchestrators across three availability zones which cost $0.384 per hour. Altogether, it cost$123.26 per hour to run tests on a Cloudera Virtual Warehouse.

For Amazon Redshift, we simply paid a set dollar amount per hour by the instance class and node count that we configured. For example, considering the 38-node ra3.4xlarge configuration in the US East region (with rates at the time of the testing) we used for the 30TB test, we paid $3.26 per hour with 38 nodes, so$123.88 per hour.

Azure Synapse Analytics charges $0.0151 per Data Warehouse Unit (DWU) per hour. Since we used a DW7500c with 7,500 units, we paid$113.25 in the East US region.

Snowflake charges $3.00 per node, and with 64 nodes engaged yields a per-hour cost to$192.

For Google BigQuery, with its 3,000 slots employed and $0.04 cost per slot per hour, the per-hour cost calculates out to$120.

With the hourly cost of the configuration, to calculate the price-per-performance, we used the following formula:

Elapsed time of test (seconds) x Cost of platform (\$/hour)
3,600 (seconds/hour)

The elapsed time of the test is actually the duration of the fastest run of all 99 queries. The following graph details the overall price-performance for each platform.

Figure 1. Price Performance

## 6.Conclusion

Cloud data warehouses are a way for enterprises to achieve advanced analytics while avoiding large capital expenditures, provisioning quickly, and providing performance at scale for advanced analytic queries. Relational databases with analytic capabilities continue to support the advanced analytic workloads of the organization with performance, scale, and concurrency. In a representative set of corporate-complex queries from the well-known TPC-DS standard, Cloudera consistently performed equally, if not better than, the competition, and it proved to be the best value in terms of price per performance.

Overall, the test results were insightful in revealing query execution performance and some of the differentiators for Cloudera Data Warehouse, Amazon Redshift, Azure Synapse, Google BigQuery, and Snowflake cloud data warehouses.

In terms of price per performance, Cloudera proved 20% less costly per unit of performance in the Field Test than the nearest competitor, Amazon Redshift, 40% less costly than Azure Synapse, and 80% less costly than Snowflake. Cloudera proved to be 5.5 times less costly than Google BigQuery in the Field Test.

Price and performance are critical points of interest when it comes to selecting an analytics platform, because they ultimately impact total cost of ownership, value, and user satisfaction. Our analysis reveals Cloudera to be very powerful and comparative in value.

## 7.Disclaimer

Performance is an important criterion for data warehouse platform selection, but it is only one factor among many to consider in the overall procurement process. Other significant factors to consider in selection include: administration, integration, workload management, user interface, scalability, vendor, reliability, and numerous others. It is our experience that performance as a selection criterion is prone to change over time and can produce different conclusions based on varying workloads. A performance leader can likewise run up against the point of diminishing returns, while viable contenders are able to close the gap quickly. All of which is to say, performance testing is a useful tool in selecting a data warehouse platform, but it should be considered in the context of the overall solution and its value.

GigaOm runs all of its performance tests to strict ethical standards. The results of the report are the objective results of the application of queries to the simulations described in the report. The report clearly defines the selected criteria and process used to establish the field test. It also clearly states the data set sizes, the platforms, the queries, and other elements used. The reader is left to determine for themselves how to qualify the information for their individual needs. The report does not make any claim regarding third-party certification and presents the objective results received from the application of the process to the criteria as described in the report. The report strictly measures performance and does not purport to evaluate other factors that potential customers may find relevant when making a purchase decision.

This is a sponsored report. Cloudera chose the competitors, the test, and the Cloudera configuration. GigaOm chose the most compatible configurations for the other tested platforms and ran the queries. Choosing compatible configurations is subject to judgment. Cloudera can be deployed on AWS, Azure, and Google Cloud, as well as on-premises. In this instance Cloudera was tested on AWS.

In this writeup, all the information necessary to replicate this test is included. You are encouraged to compile your own representative queries, data sets, data sizes, and compatible configurations and test for yourself.

At Cloudera, we believe that data can make what is impossible today, possible tomorrow. We empower people to transform complex data into clear and actionable insights. Cloudera delivers an enterprise data cloud for any data, anywhere, from the Edge to AI. Powered by the relentless innovation of the open source community, Cloudera advances digital transformation for the world’s largest enterprises.

William McKnight

An Ernst & Young Entrepreneur of the Year Finalist and frequent best practices judge, William is a former Fortune 50 technology executive and database engineer. He provides Enterprise clients with action plans, architectures, strategies, and technology tool selection to manage information.

William McKnight is an Analyst for GigaOm Research who takes corporate information and turns it into a bottom-line producing asset. He’s worked with companies like Dong Energy, France Telecom, Pfizer, Samba Bank, ScotiaBank, Teva Pharmaceuticals and Verizon — Many of the Global 2000 — and many others. William focuses on delivering business value and solving business problems utilizing proven, streamlined approaches in information management.

He is a frequent international keynote speaker and trainer. William has taught at Santa Clara University, UC-Berkeley and UC-Santa Cruz.

Jake Dolezal

As a contributing Analyst at GigaOm, Jake Dolezal has two decades of experience in the Information Management field with expertise in analytics, data warehousing, master data management, data governance, business intelligence, statistics, data modeling and integration, and visualization. Jake has experience across a broad array of industries, including: healthcare, education, government, manufacturing, engineering, hospitality, and restaurants. He has a doctorate in information management from Syracuse University.