- What the Process Looks Like
- Wrapping Up
- About Andrew Brust
Today’s data analytics stack is a paradox, representing both fundamental change and sustained tradition. On the one hand, we have significant new paradigms, including the data lake, and a number of new technologies too, including Spark and Hadoop. On the other hand, we are still working with BI tools, we are again using Structured Query Language (SQL), and not only is the data warehouse still with us, but it has taken the lead in modern analytics.
Really, though, the change is pervasive. Even the stalwart technologies and principles that we’ve conserved have been affected by broader shifts in the technology world, and the cloud, not surprisingly, is chief among them. Virtually the entire analytics stack has moved to the cloud. That is true for both storage and compute, the scalability and elasticity of which are most responsible for the adoption of the data lake and the resurgence of the data warehouse. And almost dwarfing those phenomena, cloud-based machine learning (ML) has taken off.
Each of these cloud data components is important on its own. But data preparation is, in many ways, the connective tissue that makes them work together. It is data prep that makes the data-driven whole greater than the sum of its lake, warehouse, and ML parts. That may sound like hyperbole, but in fact, it is an understatement. Data preparation moves data within the cloud. It moves data between clouds. And data preparation moves data to the cloud as each organization adopts it.
With an increasing number of organizations now using the cloud for storage and analysis of raw unstructured data, as well as for the design, training, and retraining of machine learning models, doing data prep in the cloud has become critical. Data prep is the vehicle for graduating unstructured data in the data lake to become structured data in the warehouse, delivering a platform for reporting and analytics. In machine learning workflows meanwhile, data prep is the key to transforming and streamlining data sets down to the relevant, cleansed columns needed for use in models, and one of the best approaches to feature engineering.
All of this combines to make refinement of data in the cloud a critical task, if not the critical task in the enablement of modern analytics. As such, cloud data preparation has major ramifications for the way people, processes, and technologies manifest and function.