Table of Contents
- Summary
- ELT Myths
- ELT Reality
- The Future of ELT
- About the Sponsor
- About Andrew Brust
- About GigaOm
- Copyright
1. Summary
Since the early days of data warehousing, the concept of Extract, Transform, and Load (ETL) has been well-known and familiar. In those pioneering times, enterprise organizations extracted data from their packaged systems – like those for accounting, sales and Customer Relationship Management (CRM) – as well as some custom, in-house line-of-business applications, transformed the structure so that the data from these separate systems could be correlated and conformed, and then loaded that neatened, coordinated data into the warehouse. Oftentimes, data from perhaps half a dozen systems were integrated this way, and it worked pretty well.
What has been gaining in favor and popularity, though, is a different approach to pre-processing data for the warehouse, whereby data is loaded before it is transformed, thus changing the sequence to Extract-Load-Transform, and the corresponding acronym to ELT. However, ELT is more than an anagram of ETL, and it is not just arbitrary resequencing of the same steps. It is a fundamentally different approach to pre-processing data, in terms not only of architecture but also philosophy. In this paper, we will explore what ELT is all about in terms that people familiar with ETL can appreciate and understand.
Figure 1: Reasons for ELT
Sources Increase
While in the formative days of data warehousing, the number of source systems was often in the mid-single digits, that is far from the case now. Today, data sources are everywhere. Single-purpose Web applications each serve up valuable data. The edict that organizations should be data-driven is increasingly common. Even systems that feature Application Programming Interfaces (APIs), rather than tabular data and SQL interfaces, are eligible because an increasing number of data integration platforms can consume those APIs and convert their output into query-able data. The confluence of these circumstances means that as much data, from as many sources as possible, needs to be integrated into analyses.
Ultimately, this plethora of data makes it tricky for standalone ETL systems to synthesize all of it in monolithic transformation jobs that run on expensive, difficult-to-scale infrastructure and attempt to work with all the data in memory. Instead, it makes more sense to leverage the power of the data warehouse platform itself and do the transformation work at scale. Doing so requires loading the unprocessed data into the warehouse before transforming it.
Cloud as the Cause; Cloud as the Solution
In today’s world of cloud data warehouses, especially leveraging the economic, scalable, and elastic nature of data warehouse compute and storage makes the warehouse an ideal platform for data transformation work. So while ELT may seem like a retrofit to some people, it plays to the strengths of data warehouse platforms. ELT is not a workaround bound to cause technical debt; it is an enlightened paradigm shift chosen to avoid it. Along the way, this pivot in approach does a better job of isolating load and transformation operations and data pipeline rigor, overall. This move can increase the reliability of load and transform executions and, for reasons we will explore in this paper, reduce coding errors as well.