I recently wrote about the importance of data pipelines and the role they play in transporting data between the stages of data processing and analytics. Healthy data pipelines are necessary to ensure data is integrated and processed in the sequence required to generate business intelligence. The concept of the data pipeline is nothing new of course, but it is becoming increasingly important as organizations adapt data management processes to be more data driven.
Data pipelines are used to perform data integration. They have traditionally involved batch extract, transform and load processes in which data is extracted from its source and transformed, before being loaded into a target database (typically a data warehouse). Data-driven processes require more agile, continuous data processing, with an increased focus on extract, load and transform processes – as well as change data capture and automation and orchestration – as part of a DataOps approach to data management.
The need for more agile data pipelines is driven by the need for real-time data processing. Almost a quarter (22%) of respondents to Ventana Research’s Analytics and Data Benchmark Research are currently analyzing data in real time, with an additional 10% analyzing data every hour. More frequent data analysis requires data to be integrated, cleansed, enriched, transformed and processed for analysis in a continuous and agile process. Traditional batch extract, transform and load data pipelines are ill-suited to continuous and agile processes. These pipelines were designed to extract data from a source (typically a database supporting an operational application), transform it in a dedicated staging area, and then load it into a target environment (typically a data warehouse or data lake) for analysis.
ETL pipelines can be automated and orchestrated to reduce manual intervention. However, since they are designed for a specific data transformation task, ETL pipelines are rigid and difficult to adapt. As data and business requirements change, ETL pipelines need to be rewritten accordingly.
The need for greater agility and flexibility to meet the demands of real-time data processing is one reason we have seen increased interest in extract, load and transform data pipelines. ELT pipelines involve the use of a more lightweight staging tier, which is required simply to extract data from the source and load it into the target data platform. Rather than a separate transformation stage prior to loading, ELT pipelines make use of pushdown optimization, leveraging the data processing functionality and processing power of the target data platform to transform the data.
Pushing data transformation execution to the target data platform results in a more agile data extraction and loading phase, which is more adaptable to changing data sources. This approach is well-aligned with the application of schema-on-read applied in data lake environments, as opposed to the schema-on-write approach in which schema is applied as it is loaded into a data warehouse. Since the data is not transformed before being loaded into the target data platform, data sources can change and evolve without delaying data loading. This potentially enables data analysts to transform data to meet their requirements rather than have dedicated data integration professionals perform the task. As such, many ELT offerings are positioned for use by data analysts and developers, rather than IT professionals. This can also result in reduced delays in deploying business intelligence projects by avoiding the need to wait for data transformation specialists to (re)configure pipelines in response to evolving business intelligence requirements and new data sources.
Like ETL pipelines, ELT pipelines may also be batch processes. Both can be accelerated by using change data capture techniques. Change data capture is similarly not new but has come into greater focus given the increasing need for real-time data processing. As the name suggests, CDC is the process of capturing data changes. Specifically, in the context of data pipelines, CDC identifies and tracks changes to tables in the source database as it is inserted, updated or deleted. CDC reduces complexity and increases agility by only synchronizing changed data, rather than the entire dataset. The data changes can be synchronized incrementally or in a continuous stream.
CDC can be used to optimize both ETL and ELT processes. In combination with ETL, CDC can be used to stream data changes into the ETL staging area to be transformed and then loaded in bulk into the target data platform. In combination with ELT, CDC can be used to stream data changes into the target data platform for transformation, increasing the agility of the overall data pipeline. We assert that by 2024, 6 in ten organizations will adopt data-engineering processes that span data integration, transformation and preparation producing repeatable data pipelines that create more agile information architectures.
Both ETL and ELT pipelines can be automated and orchestrated to provide further agility by reducing the need for manual intervention. Specifically, the batch extraction of data can be scheduled to occur at regular intervals of a set number of minutes or hours, while the various stages in a data pipeline can also be managed as orchestrated workflows using data engineering workflow management platforms. As previously mentioned, data observability also has a key role to play in monitoring the health of data pipelines and associated workflows as well as the quality of the data itself.
Not all data pipelines are suitable for ELT and CDC, and the need for batch ETL pipelines remains. However, ELT and CDC approaches have a role to play, alongside automation and orchestration, in increasing data agility, and I recommend that all organizations consider the potential advantages of more agile data pipelines on driving business intelligence and transformation change.