Matt Aslett's Analyst Perspectives

The Increasing Importance of Open Table Formats

Written by Matt Aslett | Oct 31, 2024 10:00:00 AM

I previously wrote about the importance of open table formats to the evolution of data lakes into data lakehouses. The concept of the data lake was initially proposed as a single environment where data could be combined from multiple sources to be stored and processed to enable analysis by multiple users for multiple purposes.

The first iteration of data lakes, based on Apache Hadoop and then object storage, fulfilled the first part of this equation, providing a relatively inexpensive way of storing large volumes of data from multiple enterprise applications and workloads, especially semi- and unstructured data that is unsuitable for storing and processing in a data warehouse.

It was not until the addition of open table formats—specifically Apache Hudi, Apache Iceberg and Delta Lake—that data lakes truly became capable of supporting multiple business intelligence (BI) projects as well as data science and even operational applications and, in doing so, began to evolve into data lakehouses.

Support for open table formats is now a critical feature for providers of analytic data platforms to enable the persistence and analysis of structured and unstructured data in object storage. Recent announcements have also highlighted the growing importance of open table formats to the persistence of streaming data, as well.

I assert that by 2026, 9 in 10 current data lake adopters will be investing in data lakehouse architecture to improve the business value from the accumulated data. The high level of confidence is due in part to the critical importance of open table formats in delivering value from data.

The benefits of open table formats are varied and numerous. Open table formats provide support for atomic, consistent, isolated and durable (ACID) transactions and create, read, update and delete (CRUD) operations on data stored in object storage. Support for ACID transactions and CRUD operations provides consistency and reliability guarantees that give enterprises the confidence that data lakehouse environments are suitable for analytics workloads traditionally associated with data warehouses.

Widespread software provider support for open table formats also enables enterprises to use multiple data platform or SQL query engine products to access and query data in cloud object storage environments, providing flexibility and avoiding over-reliance on a single software provider for both compute and storage.

The three primary open table formats have much in common: they are all developed by open-source foundations; they all write and store data in Apache Parquet files; and all are supported by multiple software providers. They are separate and competing formats with different origins, however. Apache Hudi was created at Uber in 2016 and became an Apache Software Foundation project in 2020, while Apache Iceberg was created at Netflix in 2017 and also became a project of the ASF the following year.

Delta Lake was announced by Databricks in April 2019 and became a project of the Linux Foundation in October 2019, although its development remains dominated by employees of Databricks. Key Apache Hudi developers include employees of Uber, Alibaba and Onehouse, which was founded in 2021 by Vinoth Chandar, the original creator of the Hudi project at Uber, who continues to lead its development under the ASF. Apache Iceberg developers include employees of Netflix, Apple, AWS, Tabular and Dremio.

Competition between formats can be good for innovation but also has the potential to force enterprises to make a commitment to one format over the others. A variety of recent initiatives have increased the potential for interoperability between open table formats. This has also lowered barriers to adoption by reducing concerns about the danger of enterprises becoming locked-in to a specific format and software provider, and the overall fragmentation of the market.

In 2023, Onehouse announced an initiative to provide interoperability across table formats. Initially called Onetable, the project became Apache XTable in September 2024 and provides a lightweight translation layer to translate metadata between table formats without the need to duplicate or modify the data. The Delta Lake project also introduced interoperability with other table formats in 2023 via Delta Universal Format (or UniForm), which allows data stored in Delta tables to be read as if it were Apache Iceberg or Apache Hudi.

Databricks went one step further in June 2024 by announcing the acquisition of Tabular, a data management provider founded in 2021 by the original creators of Apache Iceberg, adding significant Iceberg expertise to help expand the capabilities of UniForm. Given Databricks’ prior commitment to Delta Lake as a direct alternative to Apache Iceberg, the acquisition of Tabular was a significant change of strategy.

The fact that the Apache Hudi and Delta Lake projects have both delivered interoperability with Apache Iceberg rather than vice versa is illustrative of the level of momentum behind the latter table format. This has also been highlighted recently by moves by streaming data providers—including Cloudera, Confluent, Redpanda and StreamNative—to add support for converting streaming data into Apache Iceberg tables for long-term persistence and analysis of event data. While it is perhaps still too early to crown Apache Iceberg as the open data format champion, there is clear potential for Apache Iceberg to become the lingua franca for storing and analyzing both transactional and event data.

For existing data lake users, the adoption of open table formats is something of a no-brainer given the consistency and reliability guarantees they provide. I assert that by 2026, more than 8 in 10 data lake users will adopt emerging table formats to support transactional consistency when inserting, updating and deleting data in data lake environments. I recommend that all enterprises, whether data lake users or not, evaluate the potential advantages of open table formats in general, and Apache Iceberg in particular, to enable the persistence and analysis of transaction and event data in object storage. I also recommend that when evaluating software providers, enterprises must be sure to check whether support relates to read-only access to data in open table formats or extends to the ability to write data in open data formats as well.

Regards,

Matt Aslett