Monte Carlo Addresses Performance as a Measure of Data Reliability

Written by Matt Aslett | Jun 27, 2024 10:00:00 AM

I recently wrote about the role data observability plays in generating value from data by providing an environment for monitoring its quality and reliability. Data observability is a critical functional aspect of Data Operations, alongside the development, testing and deployment of data pipelines and data orchestration, as I explained in our Data Observability Buyers Guide. Maintaining data quality and trust is a perennial data management challenge, often preventing organizations from operating at the speed of business. A myriad of new software providers have emerged in recent years, with products designed to address this challenge through automation, including Monte Carlo.

Monte Carlo was founded in 2019 with the intention of providing data engineers with tools to monitor the validity of data pipelines, similar to those available to enable IT engineers to identify and resolve software and infrastructure failures and performance problems. Like other early data observability pioneers, Monte Carlo’s founders set out to create an environment for monitoring the quality and reliability of data used for analytics and governance projects that was inspired by the observability platforms that provide software and infrastructure engineers with an environment for monitoring metrics, traces and logs to track application and infrastructure performance. The company invested in automation to handle the growing range of data sources and the volume of data involved in data-driven decision-making, providing differentiation from traditionally manual data quality tools. As I previously explained, while data quality software is concerned with the suitability of the data to a given task, data observability is concerned with the reliability and health of the overall data environment. Data observability tools monitor not just the data in an individual environment for a specific purpose at a given point in time, but also the associated upstream and downstream data pipelines. By doing so, data observability software ensures that data is available and up to date, avoiding downtime caused by lost or inaccurate data due to schema changes, system failures or broken data pipelines. While data quality software is designed to help users identify and resolve problems related to the validity of the data itself, data observability software is designed to automate the detection and identification of the causes of data quality problems. As such, data observability can potentially enable users to prevent data quality issues before they occur.

Monte Carlo was initially focused primarily on issues related to the nature of the data—its freshness, quality, volume, schema and lineage—but is increasingly also focusing its attention on the additional aspect of data pipeline performance. After all, the data being generated by a pipeline might be valid but if the pipeline is performing below expectations, then users’ trust in the output is likely to be impacted. Data pipeline delays are especially critical given that almost two-thirds (64%) of participants in Ventana Research’s Analytics and Data Benchmark Research cite reviewing data for quality and consistency issues as the most time-consuming task in analyzing data. Specifically, Monte Carlo added pipeline performance monitoring as a key feature of its data observability platform in August 2023, enabling users to detect anomalies, identify the root cause of performance problems, and evaluate the impact of changes in code, data and software configuration.

Another recently announced key feature from Monte Carlo is taking data observability closer to the origins of the problem. In April 2023, Monte Carlo introduced integration with Fivetran’s data movement and transformation software, enabling users of Monte Carlo and Fivetran to add monitoring to data pipelines at the point of creation. The company already offers integration with data orchestration technologies, including Apache Airflow, dbt Core and dbt Cloud, and in March it added the ability to explore the lineage of data orchestration workflows generated using Apache Airflow and Databricks. March also saw the introduction of Data Explorer, which provides interactive exploration of data profile metrics, while the company has added Data Product Dashboard capabilities to its data observability platform to provide data teams with the ability to define the various data assets (such as tables, reports, dashboards and models) that are combined into data products. As I previously explained, data products are the outcome of applying product thinking to datasets to ensure that they can be discovered and consumed by others. I assert that by 2027, more than 6 in 10 enterprises will adopt technologies to facilitate the delivery of data as a product as they adapt their cultural and organizational approaches to data ownership in the context of data mesh.

The new capabilities introduced by Monte Carlo in recent months are indicative of the desire to ensure that improved trust in data is enabled by making data observability a core component of any data strategy. Enterprises that embrace data observability have the potential to improve the quality of the data as it is generated and processed, as opposed to checking for quality problems after the event. I recommend that any enterprises looking to improve trust in data to enable greater adoption of data-driven decision-making evaluate the potential of data observability software and include Monte Carlo in their assessments.

Regards,

Matt Aslett

View full post