Evaluating your ETL / ELT Capabilities, or Data Migration and Transformation Needs? Register for this free ON-DEMAND GigaOm Webinar, ” A Cloud Maturity Model: Migration, Movement, and Transformation”
In data warehousing, the decades-old concept of Extract, Transform, and Load (ETL) is well-known and familiar. Enterprise organizations use ETL to extract data from their packaged systems and some custom, in-house line-of-business applications; transform the structure so that the data from these separate systems can be correlated and conformed; and then load that neatened, coordinated data into the warehouse. Oftentimes, data from half a dozen systems can be integrated this way, and it works pretty well.
A new approach to pre-processing data for the warehouse has been gaining in favor and popularity, however. Extract-Load-Transform (ELT) modifies the sequence to load data before it is transformed. However, it is more than a simple arbitrary resequencing of the same steps. It is a fundamentally different approach to pre-processing data, in terms of both architecture and philosophy.
Unfortunately, misconceptions around ELT have sprung up, and these myths can discourage its adoption. Here, we tackle the two biggest myths around ELT and explore why they are wrong and why your organization should consider ELT if it hasn’t already.
Myth #1: ELT Is Just a Gimmicky Pivot on ETL
As a general statement, ELT is not just a novel exercise to show that changing the order of operations (i.e. transforming data after loading it, rather than before) yields an equivalent result. Instead, the ELT approach acknowledges that ETL platforms, which often run on a single server, take on an undue computing burden as the number of data sources and volume of data both increase.
In the “old days” of loading data from maybe half a dozen systems, at a frequency of once per day (or less), the burden was reasonable, and running it on ETL infrastructure took that load off the warehouse itself. This division made sense… then. In the present environment, however, data sources have increased by orders of magnitude, and load frequencies have increased dramatically – in some cases running almost continuously. This change means that the ETL infrastructure that formerly reduced load and contention on the warehouse can now become a point of failure in its continuous operation.
Furthermore, ELT systems can manage load logic natively, taking on scheduling, monitoring, and exception handling without requiring dedicated coding, and eliminating the range of errors such coding can introduce. Further, because the jobs leverage the computing power and MPP architecture of the corporate data warehouse (CDW), they run faster and provide the greater concurrency necessary to accommodate the increase in data sources, volumes, and load frequency. Transformation jobs, meanwhile, run on the warehouse itself and can take advantage of its (often much) greater scalability. This approach conforms much more closely to the principle of using the right platform for the right job.
Far from being a simple rearrangement of a process, ELT is a transformation of it. It frees up computing power, creates efficiencies in time and power use, and allows infrastructure to handle greater load.
Myth #2: ELT Implies a Schema-on-Read Approach
Identifying and untangling this myth involves some appreciation of nuance and clearly defining our terms. When we entered the era of Big Data (which, after all, is one of ELT’s catalysts), we also began endorsing a new proposition of working with analytic data, dubbed “schema-on-read.”
This approach, which works best for ad hoc analysis, involves deferring transformation until analysis time, rather than performing it in advance. With schema-on-read, data loading takes place on its own, just as it does with ELT. But while schema-on-read and ELT share that overlap, the two are not the same thing. And the distinction is a non-trivial one, especially in the case of the data warehouse.
Schema-on-read can work very well in data lake environments, where ad hoc analysis that explores “unknown unknowns” takes place. In such circumstances, it makes sense to defer the imposition of schema, because the context of the analysis is variable.
But the data warehouse scenario is different and, by its production nature, typically disqualifies schema-on-read.
While not invalidating that approach, the data warehouse model asserts that for certain analyses, especially those that execute repeatedly (and thus require optimized performance), data must be transformed in advance of analysis. This schema-on-write approach makes the data more consumable for drill-down analysis, avoids executing the same transformations repeatedly, and makes explicit the idea that formal schema is desirable for operational use cases.
As it turns out, ELT does not rule out schema-on-write at all; in fact, it accommodates it quite well. With ELT, data transformation still happens and can fit right into the schema-on-write pattern. Once the load step has completed, transformation can kick off in earnest. When it does, it executes as a dedicated process, using the engine underlying the data warehouse. ELT can also leverage the data warehouse’s native language, SQL, for its ability to effect data transformation declaratively, rather than requiring execution loops containing multiple imperative instructions, to get the job done.
Because most cloud data warehouses leverage a Massively Parallel Processing (MPP) architecture, transformation jobs running on them can execute efficiently, using the divide-and-conquer approach MPP uses to scale performance. And because many cloud data warehouses use columnar storage that allows large volumes of data to be placed in memory, ELT does not lose any of the memory-based performance that many ETL platforms support.
As a bonus, in cases where customers do prefer a data-lake-like approach using schema-on-read, ELT can accommodate it. The key takeaway, however, is that it does not require it. In short, just because ELT does not enforce imposing schema when data is first loaded, does not mean it precludes schema-on-write.