Data observability was a hot topic in 2022 and looks likely to be a continued area of focus for innovation in 2023 and beyond. As I have previously described, data observability software is designed to automate the monitoring of data platforms and data pipelines, as well as the detection and remediation of data quality and data reliability issues. There has been a Cambrian explosion of data observability software vendors in recent years, and while they have fundamental capabilities in common, there is also room for differentiation. One such vendor is Soda Data, which offers an open-source platform for self-service data observability that is focused on facilitating collaboration between business decision-makers and data teams responsible for generating and managing data to improve trust in data.
Headquartered in Brussels, Belgium, Soda was founded in late 2018 by CEO Maarten Masschelein and CTO Tom Baeyens, seasoned executives in the European software sector, with experience at companies including Collibra, JBoss and Alfresco. The company is focused on a perennial data management challenge: reducing the amount of time that organizations spend dealing with data quality and reliability issues. Although manual software products have been used for many years to detect and fix data quality problems, these are no longer efficient given the increasing reliance of organizations on data pipelines to ensure data-driven decision-making. Almost two-thirds of participants (64%) in our Analytics and Data Benchmark Research cited reviewing data for quality issues as being one of the most time-consuming aspects of analytics initiatives, second only to preparing data for analysis. Data observability takes advantage of machine learning (ML) to automate the monitoring and remediation of data quality issues. It is a key element of Data Operations (DataOps), alongside data orchestration. Like all data observability products, Soda’s offering is designed to enable organizations to detect and fix data quality issues. The company has differentiated its offering by focusing on the needs of data consumers, including data analysts and business decision-makers, as well as data producers, including data engineers and IT teams. Soda is particularly focused on midsized customers that are mature in their use of data and are reliant on data and data teams. A prime example of its growing number of customers is HelloFresh. The company has also attracted the interest of investors, raising 11.5 million euros ($13.9 million) in its February 2021 Series A funding round, provided by Singular, Point Nine Capital, Hummingbird Ventures, DCF and angel investors.
While data quality is a persistent data management challenge, Soda’s founders identified it as an increasingly critical problem based on the growing reliance of organizations on data engineering, as well as a lack of tooling available to software engineers to manage, monitor and fix data quality issues. Soda’s founders recognized that while engineers are typically responsible for maintaining data reliability, the tools available to them were primarily focused on testing and monitoring data ingestion and integration pipelines, rather than the quality of the data inputs and outputs. “Garbage in, garbage out” is an age-old concept in computer science. A data pipeline functioning as expected does not provide any guarantees as to whether the data generated by the pipeline can be relied upon for decision-making. As such, we are seeing increased interest in data observability to complement data pipeline orchestration. I assert that through 2025, 6 in 10 organizations will invest in data reliability initiatives to improve trust in data through automated data quality monitoring, alerts and resolution.
Soda’s founders also identified that while data teams are responsible for maintaining data reliability, the arbiters of data quality in any organization are not data engineers but data consumers, such as data analysts and business decision-makers. The company’s approach to data observability, Soda Cloud, is designed to be a self-service platform that empowers and incentivizes everyone in an organization to participate in improving data quality. Soda Cloud provides an environment through which data consumers can set expectations for data quality by defining data quality agreements, as well as take responsibility for automatically generated alerts related to data quality issues, investigate and report incidents. Data producers can use Soda Cloud to prioritize and resolve incidents using root-cause analysis functionality, as well as take steps to prevent repetition via the implementation of circuit breakers within data pipelines. These are checks that stop the data pipeline in the event of failure until the related data has been reviewed.
Specifically, Soda Cloud offers a low-code environment for data consumers to define data quality agreements and write data quality checks using SodaCL (Soda Checks Language), a human-readable, domain-specific language for data quality management. Another key element of Soda Cloud is Soda Core, an open-source command line tool which connects to and scans the source data platforms (Amazon Athena, Amazon Redshift, Apache Spark, Databricks, Google BigQuery, PostgreSQL and Snowflake). Soda Core converts the quality checks written in SodaCL into SQL queries that are executed against the relevant datasets to identify invalid, missing or unexpected data. Soda Core is also responsible for sending the metadata related to issues identified by data quality checks to Soda Cloud, where it can be monitored and reviewed by users. Soda also provides Soda Agent, a container environment with an instance of Soda Core that can be installed in a customer’s own cloud environment. Soda Cloud provides integration with data catalogs (Alation, Amundsen, Collibra and Metaphor), data orchestration tools (Apache Airflow, Dagster, dbt and Prefect), incident management tools (Jira, Opsgenie, PagerDuty and ServiceNow), business intelligent dashboards (Google Looker, Microsoft PowerBI and Salesforce Tableau), and collaborative communication applications (Microsoft Teams and Salesforce Slack). Licensing options for Soda Cloud include Soda Team for small groups, as well as Soda Enterprise, which provides availability for all users in an organization.
We are still at the early stages of adoption of data observability technology, and while customer interest is growing, driven by an increased focus on data reliability as well as agile, automated DataOps tooling, so is the number of competing vendors. Soda’s focus on facilitating agreements between data consumers and data producers is a differentiator and reflects the importance of data users in identifying data quality concerns. While SodaCL is human-readable, there is the potential to lower barriers to adoption for data consumers with a more visual no-code approach. I recommend that organizations exploring approaches to improving data reliability should evaluate the emerging data observability providers, including Soda, to understand how they can facilitate greater trust in data and accelerate data-driven business decisions.