Big data analytics platforms load, store, and analyze volumes of data at high speed, providing timely insights to businesses. Data-driven organizations leverage this data, for example, for advanced analysis to market new promotions, operational analytics to drive efficiency, or for predictive analytics to evaluate credit risk and detect fraud. Customers are leveraging a mix of relational analytical databases and data warehouses to gain analytic insights.
This report focuses on relational analytical databases in the cloud, because deployments are at an all-time high and poised to expand dramatically. The cloud enables enterprises to differentiate and innovate with these database systems at a much more rapid pace than was ever possible before. The cloud is a disruptive technology, offering elastic scalability vis-à-vis on-premises deployments, enabling faster server deployment and application development, and allowing less costly storage. For these reasons and others, many companies have leveraged the cloud to maintain or gain momentum as a company.
This report outlines the results from the GigaOm Analytic Field Test based on an industry standard TPC Benchmark™ H (TPC-H)1 to compare Amazon Redshift, Azure SQL Data Warehouse, Google Big Query, and Snowflake Data Warehouse—four relational analytical databases based on scale-out cloud data warehouses and columnar-based database architectures. Despite these similarities, there are some distinct differences in the four platforms.
Amazon Web Services Redshift was the first managed data warehouse cloud service and continues to get a high level of mindshare in this category. It indeed ticks all the table stakes boxes for a cloud analytic database.
In terms of storage, Amazon Redshift has a tight coupling between compute nodes and storage; data is stored on the compute nodes with a backup on AWS S3. While tight coupling keeps the data in close proximity to compute resources, it creates difficulty when scaling. Changing the instance size requires data movement to repartition and redistribute data on the new topology.
Amazon Redshift Spectrum can create an external table to store raw data on S3. You would not want to architect a steady diet of this form of data virtualization, but it is great to have it available for edge queries. Data virtualization has become very interesting for analyzing disparate data sets dispersed across multiple clouds, on-premises, and edge devices. . Redshift Spectrum has some future-proofing that a modern data engineering approach might utilize.
In support of diverse data, Amazon Redshift has a few nice features like the JSON_EXTRACT_PATH_TEXT function for noSQL key value databases. However, key value nesting is limited and does not support large text columns like BLOBs, so it cannot store raw data in a Variable Character field (VARCHAR) larger than 64K.
Amazon Redshift is a fit for organizations needing a data warehouse with little to no administrative overhead and a clear, consistent pricing model. Amazon Web Services supports most of the databases in this report, and then some. Redshift is not the only analytic database on AWS, although sometimes this gets convoluted.
Azure SQL Data Warehouse storage is separate from the compute Data Warehouse Unit (DWU). This enables Azure SQL Data Warehouse to scale columnar storage capacity and compute resources independently. This capability adjusts to various workload demands, offering potential cost savings when demand is low. Azure SQL Data Warehouse can pause and resume compute billing, where only storage is billed during the paused time. Azure SQL Data Warehouse achieves good balance in both configurability and simplicity, in a way that is both easy to administer and flexible in handling almost any usage pattern.
Azure SQL Data Warehouse is fully ANSI-SQL compliant and users familiar with SQL Server will be very comfortable using this environment.
Azure SQL Data Warehouse can export data to a local file the same way an on-premises SQL Server can, e.g., via the SQL Server Import and Export Wizard.
Although concurrency was not tested in the benchmark, Azure SQL Data Warehouse supports 128 concurrent queries. This is many more than BigQuery, which supports a maximum concurrency of 50 per project. Snowflake’s maximum concurrency is difficult to calculate because it is a function of the number of queries, the submitted queries’ execution plan, the size of the warehouse, and the maximum number of multi-cluster setting. In our experience we saw an X-Large (16 node) Snowflake warehouse run 6 concurrent simple scan queries (SELECT with a single column filter WHERE clause) before starting to queue. Thus, if we set the maximum multi-clusters at 5, we would likely hit a max concurrency of 30 in that scenario. Your results may vary.
Google BigQuery is a managed cloud data warehouse service with some interesting distinctions. Google abstracts the details of the underlying hardware, database, and all configurations.
BigQuery is a hands-off database without indexes or column constraints. Defragmentation and system tuning are not required. It is truly serverless. Google Cloud manages the servers in a fully hands-off manner to the customer, dynamically allocating storage and compute resources. The customer does not define nodes and capacity of the BigQuery instance. The provisioning of compute is particularly fast and seamless.
You pay for the amount of data you query and store. Customers can pre-purchase flat-rate computation “slots” or units in increments of $10,000 per month per 500 compute units.
We simply consumed the results for this field test, but should we have been looking to do more with the data, such as exporting in different formats, BigQuery has capabilities to do so.
Also although we did not time data ingest, this is an area of strength for BigQuery. BigQuery ingest does not impact the performance of queries because separate compute slots are used for loading apart from the ones performing the SELECT statements for users. Ingest stats are auto-generated and data with an unknown schema can be loaded using schema auto-suggestion. Also, there is no charge for ingest.
As a cloud-only, fully managed solution, Snowflake has a clear separation between compute and storage. For Snowflake on AWS, which is what we used for the queries, data is stored in AWS S3 and is cached when queries are executed to bring the data in closer proximity to compute resources. Snowflake essentially offers two configuration “levers” — the size of the warehouse cluster and how many clusters are permitted to spin up to handle concurrency. Snowflake scales by cluster server count in powers of 2 (i.e., 1, 2, 4, 8, 16, and so on.) If enabled, Snowflake will spin up additional clusters to handle multi-user concurrent query workloads. If not enabled, it will place paused queries in a queue until resources free up.
While simple and mostly abstracted for the user, this lack of configurability is a tradeoff and, in our experience, performance is impacted. In our estimation, Snowflake performance hinges greatly on cache. If Snowflake must access Amazon S3 for additional data, query execution slows dramatically. Historically, Snowflake has not disclosed the server class or size of its AWS EC2 instances. Thus, one cannot calculate beforehand what configuration would be most optimal for workloads — it is an exercise in trial and error.
Additionally, Snowflake’s multi-cluster option is quite expensive as costs multiply rapidly when multiple clusters are employed. For example, if a customer is paying the $4 per hour rate for full support and security, an X-Large 16-node cluster normally costs $64 per hour to run. If users hit Snowflake with a high volume of concurrent query requests (for example, at the end of month), Snowflake will spin up an additional four clusters to handle the workload. This would spike the cost to $320 per hour. Snowflake would automatically spin the additional clusters down once demand has passed, which helps, but predicting and budgeting cost for performance can be tricky with Snowflake.
Snowflake supports an ANSI-compliant form of SQL. However, we experienced syntax differences when converting existing queries to Snowflake.
Snowflake only supports data export to S3 (on AWS). However, it is possible to use the Snowflake command line interface results to direct data to a file in a Linux shell.
|Azure SQL Data Warehouse||Snowflake Data Warehouse||Google BigQuery||Amazon Redshift|
|First Released||1989 (SQL Server)
2005 (Analytics Platform System)
2016 (Azure SQL Data Warehouse)
|Current Version||12.0.2000.8||3.0.5||October 11, 2018 release||1.0.5463|
|SQL||Transact-SQL||Snowflake SQL||DML||PostgreSQL 8|
|Scale Out Cloud Data Warehouse||✓||✓|
The time to reach all worldwide regions for this release of Microsoft Azure SQL DW is January-February 2019.
The benchmark equivalency was established by price, selecting the lowest, no support options across the four platforms. DW30000C costs $362.40/hour to run. 4X-Large Snowflake @ $3.00/node costs $384/hour. In the Price-Performance calculation below, we used the lowest, no support $2.00/node price for Snowflake. The parameter values for the queries used across all vendors are informed by the TPC Benchmark™ H (TPC-H)2 spec validation queries. This is not an official TPC benchmark. The queries were executed using the following setup, environment, standards, and configurations.
To evaluate performance, we used a derivation of the industry-standard TPC-H benchmark, which we call GigaOm Analytic Field Test. This workload can be described as follows:
i. the schema and data are used from TPC-H
ii. The queries used for the benchmark tests were compliant with the standards set out by the TPC-H specification and included only minor query modifications as set out by section 2.2.3 of the TPC-H specification document. For example, minor query modifications included vendor-specific syntax for date expressions. Also in the specification, queries 2, 3, 10, 18 and 21 require row limits and, thus, vendor specific syntax was used (e.g., TOP, FIRST, LIMIT, and so forth) as allowed by section 18.104.22.168 of the TPC-H specification.
iii. no throughput runs (multi-stream) runs are performed
iv. three power runs (single-stream) are performed, using the query parameters defined for the TPC-H qualifications queries
v. The primary metric used was the aggregate total of the best execution times for each query. Three power runs were completed. Each of the 22 queries was executed three times in order (1, 2, 3,…) against each vendor cloud platform, and the fastest of the three times was used as the performance metric. These best times were then added together to gain the total aggregate execution time for the entire workload.
vi. The price-performance metric is dollars per query-hour ($/query-hour). This is defined as the normalized cost of running the TPC-H workload on each of the cloud platforms. It was calculated by multiplying the best on-demand rate (expressed in dollars) offered by the cloud platform vendor (at the time of testing) times the number of computation nodes used in the cluster and by dividing this amount by the aggregate total of the best execution times for each query (expressed in hours).
The data sets used in the benchmark were a workload derived from the well-recognized industry standard TPC Benchmark™ H (TPC-H)3.
From tpc.org: “The TPC-H is a decision support benchmark. It consists of a suite of business-oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.”
To show the data model, the following diagram was taken from page 13 of the TPC-H Revision 2.17.3 specification document.
To give an idea of the data volumes used in our benchmark, the following table gives row counts of the database when loaded with 30TB of TPC-H data:
|TPC-H Table||30TB Row Count|
Our benchmark included ten (10) different cluster environments:
Schema was aligned to be the same for both products as shown in the table below. Nation, Region and Supplier tables were distributed evenly (SQL DW DISTRIBUTION ROUND_ROBIN and Redshift DISTSTYLE ALL), all other tables were distributed using a hash key on the same distribution columns(SQL DW DISTRIBUTION HASH and Redshift DISTRIBUTION KEY).
|Table||Distribution Column||SQL DW Distribution||Redshift Distribution|
Text data files were generated using TPC-H data generation DBGen tool. The files were stored on Azure Blob Storage and copied to Amazon S3. The same files were used to load Azure SQL DATA WAREHOUSE with Polybase CTAS command, and Redshift using COPY command from their respective cloud data stores. After the data files were loaded, it was confirmed that the row counts were identical.
We sought to replicate the TPC-H Benchmark queries modified only by syntax differences required by the platforms. The benchmark is a fair representation of enterprise query needs. The TPC-H testing suite has 22 queries, which are described by the table below. The SQL substitution parameters used in each query were TPC-H specification validation queries and are noted.
|1||Pricing Summary Report||✓||✓||✓|
|2||Minimum Cost Supplier||✓||✓||✓|
|4||Order Priority Checking||✓||✓|
|5||Local Supplier Volume||✓|
|6||Forecasting Revenue Change||✓|
|8||National Market Share||✓||✓|
|9||Product Type Profit Measure||✓||✓|
|10||Returned Item Reporting||✓||✓|
|11||Important Stock Identification||✓||✓|
|12||Shipping Modes and Order Priority||✓|
|17||Small Quantity Order Revenue||✓||✓||✓|
|18||Large Volume Customer||✓||✓||✓|
|20||Potential Part Promotion||✓||✓|
|21||Suppliers Who Kept Orders Waiting||✓||✓||✓|
|22||Global Sales Opportunity||✓||✓||✓||✓|
This section analyzes the query results from the fastest runs of the three sets of 22 GigaOm Analytic Field Test queries described in Table 3.
Query 1 is the only query that uses only Sum, Average, and Count. All Azure SQL Data Warehouse configurations were faster than Google and each configuration was faster than the analogous Snowflake and Redshift configurations.
Query 2 was one of two queries that contained a Min/Max. Google had some challenge with the query. Each Azure SQL Data Warehouse configuration was faster than the analogous Snowflake and Redshift configurations.
Google and the 10N for Redshift had some challenge with the query and each Azure SQL Data Warehouse configuration was faster, twice by 3X+, than the analogous Snowflake configuration.
In Query 4, with a subquery and a count, we see the familiar pattern of relative performance except Google performance is closer to the Snowflake 2X Large performance.
Query 5, which only employs a sum aggregation, favored Azure SQL Data Warehouse as well.
The simple SUM of Query 6 was a very high relative performer for Azure SQL Data Warehouse.
Again, we see the familiar pattern of 2X+ performance between Azure SQL Data Warehouse and the analogous Snowflake configuration, improvement to the Redshift configurations and Google at a different level.
Query 8 performance was similar to Query 7.
The Snowflake 2X-Large outperformed Azure SQL Data Warehouse 5000C in Query 9.
Query 10 uniquely has a Sum and a Top/Limit. Google, with a runtime of 186 seconds, outperformed the Snowflake 2X-Large. Azure SQL Data Warehouse was the top performer.
Query 11 (another sub-select and a sum operation) saw performance slightly favoring Redshift.
Query 12, another simple SUM, was another plus performance query for Azure SQL Data Warehouse, with Azure SQL Data Warehouse outperforming analogous Snowflake configurations by 5, 14 and 10 times, Redshift by appx. 2, 4 and 3 times and Google by 4 times (5000 C).
Query 13 is the only GigaOm Analytic Field Test query with an explicit JOIN. Google outperformed the low-end configurations of Azure SQL Data Warehouse, Redshift and Snowflake.
For Query 14, Azure SQL Data Warehouse took only an impressive 7 and 6 seconds on the 15000C and 30000C configurations.
Another query with very fast performance from Azure SQL Data Warehouse, Redshift being competitive and Google being more competitive.
For Query 16, The Snowflake 2X-Large outperformed Azure SQL DW 5000C, as did all of the Redshift configurations.
Query 17 yielded performance advantages for Azure SQL Data Warehouse.
It is worth pointing out that Snowflake 3X-Large and 4X-Large outperform the DW 5000C, but the 15000C and 30000C outperform all like configurations.
Query 19, a SUM, was another strong performer for Azure SQL Data Warehouse, with even the 5000C outperforming all except the Snowflake and Redshift high specifications.
Query 20 showed Azure SQL Data Warehouse to be 3X+ as fast or faster in analogous configurations to Snowflake, faster than analogous configurations to Redshift and, worst case, 3X faster than Google.
Query 21 shows the familiar pattern of high performance differences between Azure SQL Data Warehouse and the analogous Snowflake configuration and Google at a different level.
Snowflake outperformed Azure SQL Data Warehouse in all 3 analogous configurations for Query 22, as did Redshift.
Azure SQL Data Warehouse outperformed the analogous configuration of Snowflake in 62 of the 66 (94%) best times of the GigaOm Analytic Field Test query runs. Azure SQL Data Warehouse outperformed the analogous configuration of Redshift in 54 of the 66 (82%) best times of the GigaOm Analytic Field Test query runs. Azure SQL Data Warehouse also outperformed Google BigQuery in all 100% of their respective GigaOm Analytic Field Test best times.
As previously stated, Google BigQuery charges by the byte processed not by the hour, like most cloud analytics vendors. Even though performance does not inherently affect how much you pay, we still made the price for performance calculations for BigQuery in order to understand how they compared to Azure SQL Data Warehouse and Snowflake.
During our benchmark testing, we experienced a high variance in execution times with BigQuery. See the included chart of the standard deviation of execution times in seconds.
As you can see, Azure SQL Data Warehouse and Snowflake’s performance was relatively consistent on the same queries from run to run. However, BigQuery had very wide ranges. (We chose best run instead of the average among the competitors to give BigQuery the benefit of the doubt.) For example, BigQuery’s best time was 42 minutes running GigaOm Analytic Field Test query #9. However, it’s worst time was 1 hour and 8 minutes. Twenty-six minutes is atypical for a query execution range.
One possible hypothesis as to why BigQuery performance was so erratic could be the shared, multi-tenancy architecture. Most cloud vendors allow you to choose (or only offer) dedicated hardware for your platform. BigQuery does not offer dedicated hardware. Noisy neighbors can impact performance. To illustrate, see the following chart of the slot (or compute unit) utilization for BigQuery during our benchmark runs. Please note that it would be very difficult to map slots utilized with actual queries. We also are not certain how noisy our same-rack neighbors were during the runs, but we speculate it caused the high variance we experienced.
To measure price-performance we used the TPC-H metric of cost of ownership divided by composite query per hour. If you contiguously ran all 22 of these queries to completion of the set, the cost at an hourly basis is indicated in the chart below. Azure SQL Data Warehouse provided consistently best price-performance in our queries.
For Azure SQL Data Warehouse, Redshift and Snowflake, you pay for compute resources as a function of time. With Google BigQuery, you pay for bytes processed. During a single run of the GigaOm Analytic Field Test suite, we processed roughly 113TB of data at $5 per TB for BigQuery. For the others, we took the total execution time elapsed for all 22 queries and extrapolated the hourly rate we were paying for each cluster class.
In a direct comparison of Azure SQL Data Warehouse and Snowflake, Figure 5 aligns the following cluster classes:
|Azure SQL Data Warehouse||Snowflake|
|DW5000C||10 nodes||→||2X-Large||32 nodes|
|DW15000C||30 nodes||→||3X-Large||64 nodes|
|DW30000C||60 nodes||→||4X-Large||128 nodes|
Charting the comparison, Azure SQL Data Warehouse ran the GigaOm Analytic Field Test queries roughly 3 times cheaper than Snowflake in terms of dollars per query per hour across all three clusters. See Figure 6.
Since Google BigQuery does not have a scale factor, we could only compare the different scaled SQL Data Warehouse classes against BigQuery, whose scale varies automatically with slot (compute unit) allocation chosen by the Google engine.
|Azure SQL Data Warehouse||Google BigQuery|
Charting the comparison shows Azure SQL Data Warehouse ran the GigaOm Analytic Field Test queries between 12 and 17 times more cost-effectively than Google BigQuery in terms of price per query per hour across all three clusters. See Figure 8.
Charting the comparison shows Azure SQL Data Warehouse ran the GigaOm Analytic Field Test queries between 8% and 31% more cost-effectively than Amazon Redshift in terms of price per query per hour across all three clusters. See Figure 9.
Cloud databases are a way for enterprises to avoid large capital expenditures, provision quickly, and provide performance at scale for advanced analytic queries. Relational databases with analytic capabilities continue to support the advanced analytic workloads of the organization with performance, scale, and concurrency. In a representative set of corporate-complex queries from the GigaOm Analytic Field Test, Azure SQL Data Warehouse consistently outperformed the competition.
Overall, the benchmark results were insightful in revealing query execution performance and some of the differentiators for Azure SQL Data Warehouse, Snowflake Data Warehouse, Amazon Redshift, and Google Big Query. Azure SQL Data Warehouse query response times on the 30TB GigaOm Analytic Field Test data set were overall seven times faster than Snowflake across all cluster classes, approximately twice the performance of Redshift across all cluster classes and 14 times faster than BigQuery (compared to the SQL DW15000C).
In terms of price per performance, Azure SQL Data Warehouse ran the GigaOm Analytic Field Test queries roughly 3 times cheaper than Snowflake in terms of cost per query per hour, across three examined cluster classes and 8-31% more cost-effective in terms of cost per query hour across the examined cluster classes for Redshift. Azure SQL Data Warehouse also ran the GigaOm Analytic Field Test queries between 12 and 17 times cheaper than Google BigQuery in terms of price per query per hour across all three SQL Data Warehouse clusters.
Price and performance are critical points of interest when it comes to selecting an analytics platform, because they ultimately impact total cost of ownership, value, and user satisfaction. Our analysis reveals Azure SQL Data Warehouse to be very powerful and comparative in value.
Microsoft offers Azure SQL Data Warehouse…
William McKnight is a former Fortune 50 technology executive and database engineer. An Ernst & Young Entrepreneur of the Year finalist and frequent best practices judge, he helps enterprise clients with action plans, architectures, strategies, and technology tools to manage information.
Currently, William is an analyst for GigaOm Research who takes corporate information and turns it into a bottom-line-enhancing asset. He has worked with Dong Energy, France Telecom, Pfizer, Samba Bank, ScotiaBank, Teva Pharmaceuticals, and Verizon, among many others. William focuses on delivering business value and solving business problems utilizing proven approaches in information management.
Jake Dolezal is a contributing analyst at GigaOm. He has two decades of experience in the information management field, with expertise in analytics, data warehousing, master data management, data governance, business intelligence, statistics, data modeling and integration, and visualization. Jake has solved technical problems across a broad range of industries, including healthcare, education, government, manufacturing, engineering, hospitality, and restaurants. He has a doctorate in information management from Syracuse University.
GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.
GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.
GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.