Data lakes have enormous potential as a source of business intelligence. However, many early adopters of data lakes have found that simply storing large amounts of data in a data lake environment is not enough to generate business intelligence from that data. Similarly, lakes and reservoirs have enormous potential as sources of energy. However, simply storing large amounts of water in a lake is not enough to generate energy from that water. A hydroelectric power station is required to harness and unleash the power-generating potential of a lake or reservoir, utilizing a combination of turbines, generators and transformers to convert the energy of the flowing water into electricity. A hydroanalytic data platform, the data equivalent of a hydroelectric power station, is required to harness and unleash the intelligence-generating potential of a data lake.
The data lake was first described in 2010 by James Dixon of Pentaho to refer to a single environment (at that point primarily based on Apache Hadoop) where data could be stored and processed to be analyzed by multiple users for multiple purposes. Now predominantly based on cloud object storage, data lakes have become an important part of the analytics data estate for many companies. Utilizing cloud object storage as the underlying repository makes the data lake a relatively inexpensive way of storing large volumes of data from multiple enterprise applications and workloads, especially semi- and unstructured data that is unsuitable for storing and processing in a data warehouse. Where the data lake concept struggled to deliver on its potential during its first decade, however, was in enabling easy analysis of data for multiple purposes.
Early data lake projects lacked structured data management and processing functionality, such as support for table formats, metadata management, and transactional updates and deletes as well as query engine and data orchestration functionality. These structured data processing and analytics acceleration capabilities are the equivalent of the turbines, generators and transformers in a hydroelectric power station. They need to be assembled and deployed as a hydroanalytic data platform to turn a data lake into an environment capable of supporting multiple business intelligence projects as well as data science and even operational applications.
Structured data processing and analytics acceleration capabilities are all established in data warehousing, and the fact that they were missing from early data lakes means that data warehouses have continued to be deployed by organizations to serve established business intelligence workloads, often alongside data lakes. Our research shows that more than two-thirds (67%) of organizations deploy both data lakes and data warehouses, with many feeding data between the two environments. The hydroanalytic data platform concept describes the convergence of data warehouse technologies and big data (data lake) platforms. This convergence has been ongoing for several years and is a topic that Ventana Research has previously addressed on multiple occasions. I assert that through 2024, data warehouse, data lake and data streaming technologies will converge to create analytic data platforms enabling organizations to collect and analyze all types of operations-generated information.
Just as there are multiple types of hydroelectric power stations — including dams, pumped storage and run-of-the-river power stations — so, too, can hydroanalytic data platforms take multiple forms. The two primary approaches to combining data warehouse and data lake functionality are deploying a data warehouse on or alongside the data lake, and integrating data warehousing functionality into the data lake.
Although stand-alone data warehouse offerings remain available in the cloud, the integration of cloud data warehouse environments with cloud storage-based data lakes has become common. The combination takes advantage of the ability to independently scale the compute and storage layers. It utilizes the data lake simply for low-cost storage, while the data warehouse provides governance of data in the data lake (both structured and unstructured). The data warehouse also provides the ability to persist curated subsets of structured data, apply predetermined schema and take advantage of established data warehousing functionality for high-performance and high-concurrency query requirements.
The term data lakehouse is now used by multiple vendors and organizations to describe an environment in which the functionality associated with data warehousing is integrated into the data lake environment itself. Data lakehouse offerings are available in the form of pre-integrated cloud services, or can be assembled by enterprise architecture teams via a do-it-yourself approach that combines cloud-based data lakes with a variety of (largely open source) structured data processing and analytics acceleration technologies. These projects provide functionality usually found in a data warehouse, including distributed SQL query engines; support for atomic, consistent, isolated and durable transactions; updates and deletes; concurrency control; metadata management; data indexing; data caching; schema enforcement and evolution; query acceleration; semantic models; data governance; version control; access control and auditing (among other things).
Such is the nature of the technology industry that we have seen communities emerging that advocate one approach to combining data lake and data warehousing functionality over the other — for example, the data warehouse on data lake versus data lakehouse, pre-integrated versus DIY. These are often aligned with and led by vendors with vested interests, which often rely on exaggerating both the benefits of one approach and the challenges of the other. As always, one approach is not inherently “better” than the other, and the most appropriate to a given organization will depend on several factors, including the nature of its analytic workloads, its existing technology estate and the expertise of its data and IT teams.
We believe it is important to enable discussion that addresses the common goals and overall benefits of combining data warehousing and data lake functionality without getting tied up at an early stage in discriminating between the specific architectural approaches. The hydroanalytic data platform terminology describes a conceptual environment that addresses the convergence of data lake and data warehousing functionality, but also encapsulates multiple approaches to that convergence. I recommend that all organizations — but especially those with data lake investments that are failing to deliver on their potential — consider the evolution of the data lake in the context of the hydroanalytic data platform, analogous to a hydroelectric power station. I also recommend that organizations identify potential business benefits of such an environment before focusing on which of the architectural approaches may be most suitable based on the nature of workloads as well as existing technology investments and expertise.