Almost all organizations are investing in data science, or planning to, as they seek to encourage experimentation and exploration to identify new business challenges and opportunities as part of the drive toward creating a more data-driven culture. My colleague, David Menninger, has written about how organizations using artificial intelligence and machine learning (AI/ML) report gaining competitive advantage, improving customer experiences, responding faster to opportunities and threats, and improving the bottom line with increased sales and lower costs. One-quarter of participants (25%) in Ventana Research’s Analytics and Data Benchmark Research are already using AI/ML, while more than one-third (34%) plan to do so in the next year, and more than one-quarter (28%) plan to do so eventually. As organizations adopt data science and expand their analytics initiatives, they face no shortage of options for AI/ML capabilities. Understanding which is the most appropriate approach to take could be the difference between success and failure. The cloud providers all offer services, including general-purpose ML environments, as well as dedicated services for specific use cases, such as image detection or language translation. Software vendors also provide a range of products, both on-premises and in the cloud, including general-purpose ML platforms and specialist applications. Meanwhile, analytic data platform providers are increasingly adding ML capabilities to their offerings to provide additional value to customers and differentiate themselves from their competitors. There is no simple answer as to which is the best approach, but it is worth weighing the relative benefits and challenges. Looking at the options from the perspective of our analytic data platform expertise, the key choice is between AI/ML capabilities provided on a standalone basis or integrated into a larger data platform.
I have previously written about growing interest in the data lakehouse as one of the design patterns for delivering hydroanalytics analysis of data in a data lake. Many organizations have invested in data lakes as a relatively inexpensive way of storing large volumes of data from multiple enterprise applications and workloads, especially semi- and unstructured data that is unsuitable for storing and processing in a data warehouse. However, early data lake projects lacked structured data management and processing functionality to support multiple business intelligence efforts as well as data science and even operational applications.
I have written recently about the similarities and differences between data mesh and data fabric. The two are potentially complementary. Data mesh is an organizational and cultural approach to data ownership, access and governance. Data fabric is a technical approach to automating data management and data governance in a distributed architecture. There are various definitions of data fabric, but key elements include a data catalog for metadata-driven data governance and self-service, agile data integration.
In their pursuit to be data-driven, organizations are collecting and managing more data than ever before as they attempt to gain competitive advantage and respond faster to worker and customer demands for more innovative, data-rich applications and personalized experiences. As data is increasingly spread across multiple data centers, clouds and regions, organizations need to manage data on multiple systems in different locations and bring it together for analysis. As the data volumes increase and more data sources and data types are introduced in the organization, it creates challenges to storing, managing, connecting and analyzing the huge set of information that is spread across multiple locations. Having a strong foundation and scalable data management architecture in place can help alleviate many of the challenges organizations face when they are scaling and adding more infrastructure. We have written about the potential for hybrid and multi-cloud platforms to safeguard data across heterogenous environments, which plays to the strengths of companies, such as Actian, that provide a single environment with the ability to integrate, manage and process data across multiple locations.
I have written a few times in recent months about vendors offering functionality that addresses data orchestration. This is a concept that has been growing in popularity in the past five years amid the rise of Data Operations (DataOps), which describes more agile approaches to data integration and data management. In a nutshell, data orchestration is the process of combining data from multiple operational data sources and preparing and transforming it for analysis. To those unfamiliar with the term, this may sound very much like the tasks that data management practitioners having been undertaking for decades. As such, it is fair to ask what separates data orchestration from traditional approaches to data management. Is it really something new that can deliver innovation and business value, or just the rebranding of existing practices designed to drive demand for products and services?
Ventana Research’s Data Lakes Dynamics Insights research illustrates that while data lakes are fulfilling their promise of enabling organizations to economically store and process large volumes of raw data, data lake environments continue to evolve. Data lakes were initially based primarily on Apache Hadoop deployed on-premises but are now increasingly based on cloud object storage. Adopters are also shifting from data lakes based on homegrown scripts and code to open standards and open formats, and they are beginning to embrace the structured data-processing functionality that supports data lakehouse capabilities. These trends are driving the evolution of vendor product offerings and strategies, as typified by Cloudera’s recent launch of Cloudera Data Platform (CDP) One, described as a data lakehouse software-as-a-service (SaaS) offering.
I have written before about the continued use of specialist operational and analytic data platforms. Most database products can be used for operational or analytic workloads, and the number of use cases for hybrid data processing is growing. However, a general-purpose database is unlikely to meet the most demanding operational or analytic data platform requirements. Factors including performance, reliability, security and scalability necessitate the use of specialist data platforms. I assert that through 2026, and despite increased demand for hybrid operational and analytic processing, more than three-quarters of data platform use cases will have functional requirements that encourage the use of specialized analytic or operational data platforms. It is for that reason that specialist database providers, including Ocient, continue to emerge with new and innovative approaches targeted at specific data-processing requirements.
Earlier this year I described the growing use-cases for hybrid data processing. Although it is anticipated that the majority of database workloads will continue to be served by specialist data platforms targeting operational and analytic workloads respectively, there is increased demand for intelligent operational applications infused with the results of analytic processes, such as personalization and artificial intelligence-driven recommendations. There are multiple data platform approaches to delivering real-time data processing and analytics, including the use of streaming data and event processing and specialist, real-time analytic data platforms. We also see operational data platform providers, such as Aerospike, adding analytic processing capabilities to support these application requirements via hybrid operational and analytic processing.
I have recently written about the organizational and cultural aspects of being data-driven, and the potential advantages data-driven organizations stand to gain by responding faster to worker and customer demands for more innovative, data-rich applications and personalized experiences. I have also explained that data-driven processes require more agile, continuous data processing, with an increased focus on extract, load and transform processes — as well as change data capture and automation and orchestration — as part of a DataOps approach to data management. Safeguarding the health of data pipelines is fundamental to ensuring data is integrated and processed in the sequence required to generate business intelligence. The significance of these data pipelines to delivering data-driven business strategies has led to the emergence of vendors, such as Astronomer, focused on enabling organizations to orchestrate data engineering pipelines and workflows.
The data catalog has become an integral component of organizational data strategies over the past decade, serving as a conduit for good data governance and facilitating self-service analytics initiatives. The data catalog has become so important, in fact, that it is easy to forget that just 10 years ago it did not exist in terms of a standalone product category. Metadata-based data management functionality has had a role to play within products for data governance and business intelligence for much longer than that, of course, but the emergence of the data catalog as a product category provided a platform for metadata-based data inventory and discovery that could span an entire organization, serving multiple departments, use cases and initiatives.