Data observability refers to an organization’s comprehensive understanding of the health and performance of the data within their systems and for this many turn to https://www.acceldata.io/.
Data observability tools employ automated monitoring, root cause analysis, data lineage, and data health insights to proactively detect, resolve, and prevent data anomalies. This relatively new technology category has been quickly adopted by data teams, in part due to its extensibility (here are 61 use cases it supports).
But perhaps one of the greatest advantages of a data observability platform is its time to value. Unlike data testing or modern data quality platforms, data observability solutions require minimal configuration or manual threshold setting. They use machine learning monitors to learn how your data behaves, typically over a period of less than 2 weeks, and then alert you to relevant data incidents.
Despite the ease of integration and set up, there are some best practices for implementing data observability. They can be quickly summarized as:
- Crawl: Initiate basic monitors for data freshness, volume, and schema across your system. Start building incident response capabilities by handling and resolving incidents.
- Walk: Implement field health monitors and customize monitors for critical tables to detect data quality issues. Define and communicate data pipeline SLAs with data consumers to establish trust in the data.
- Run: Prioritize the prevention of data quality issues using insights and dashboards on data health. Expand support to additional areas like MLOps engineering and beyond.
Implementing data observability effectively requires attention to detail. Here are six steps, tried and tested by numerous data teams, for mastering data observability implementation.
Step 1: Inventory Data Use Cases (Pre-Implementation)
One of the initial steps in implementing data observability is to evaluate your existing and upcoming data use cases, categorizing them into three main types:
- Analytical: Data primarily utilized for decision-making or assessing business strategies via BI dashboards.
- Operational: Data directly supporting business operations in near-real-time, often involving streaming or micro-batch data. Examples include customer support interactions or ecommerce recommendation algorithms.
- Customer-facing: Data integrated into or enhancing the product offering, or data serving as the product itself. For instance, a reporting suite within a digital advertising platform.
This categorization is vital because data quality requirements vary depending on the context. Some scenarios, like financial reporting, demand utmost accuracy, while others, like certain machine learning applications, prioritize data freshness over absolute precision.
Checkout.com’s Senior Data Engineer, Martynas Matimaitis, notes, “Given our presence in the financial sector, we encounter diverse use cases for both analytical and operational reporting that necessitate high accuracy levels.” This led Checkout.com to prioritize data quality management early on in their journey, becoming integral to their daily operations.
The subsequent step involves evaluating the overall performance of your systems and team. At the outset, detailed insights into data health and operations might be lacking. However, you can use both quantitative and qualitative indicators:
- Quantitative: Measure metrics like data consumer complaints, overall data adoption rates, and levels of data trust (e.g., through NPS surveys). Additionally, estimate the time spent by the team on data quality management tasks such as maintaining tests and resolving incidents.
- Qualitative: Assess factors such as the appetite for advanced data use cases, whether leaders feel they’ve fully harnessed the organization’s data, the presence of a data-driven culture, and any recent data quality incidents prompting senior-level attention.
Categorizing data use cases and establishing performance baselines aids in identifying gaps between the current and desired future states across infrastructure, team dynamics, processes, and performance.
Step 2: Rally And Align The Organization (Pre-Implementation)
Once you’ve established a baseline, the next step in advancing your data observability initiative is to garner support from various stakeholders. Understanding the pain points experienced by different parties is crucial.
If no evident pain exists, it’s essential to investigate why. It’s possible that either the scale of your data operations or the perceived importance of your data isn’t significant enough to justify investing in data quality improvement through observability. However, this scenario is unlikely if you manage more than 50 tables or if your data team routinely handles data quality issues.
More likely, your organization harbors unrealized risks. While data quality may be currently satisfactory, the potential for costly incidents looms. Typically, data consumers trust the data until a reason arises for distrust, making rebuilding trust challenging once lost.
Assessing the overall risk of poor data quality is complex. Consequences can vary from slightly suboptimal decision-making to reporting inaccurate data to stakeholders like Wall Street.
One approach is to quantify this risk by estimating data downtime and attaching an inefficiency cost to it. Alternatively, industry benchmarks can provide insights – studies suggest that bad data can impact, on average, 31% of a company’s revenue.
This risk assessment, along with the costs incurred by business stakeholders dealing with poor data, offers valuable insights, albeit somewhat imprecise. It should also consider the opportunity cost of excessive data engineering hours spent addressing data quality issues.
By tallying the time spent on data quality tasks and multiplying it by the average data engineering salary, you can gauge the business case for data observability implementation.
Once you’ve obtained a mandate and decided on either developing a machine learning-based data monitoring solution or implementing a data observability solution, it’s time to proceed with implementation and scalability.
Step 3: Implement Broad Data Quality Monitoring
In the third phase of implementing data observability, it’s crucial to ensure basic machine learning monitors (such as freshness, volume, and schema) are deployed across your entire data environment. Instead of piloting and gradually scaling, particularly for smaller organizations excluding larger enterprises, it’s advisable to roll out these monitors across every data product, domain, and department.
This broad implementation approach accelerates the time to value and establishes essential connections with different teams if not already established.
Another rationale for a widespread rollout is that data, even in decentralized organizations, is interdependent. Installing fire suppression systems in the living room while a fire rages in the kitchen doesn’t offer much benefit.
Moreover, implementing wide-scale data monitoring or data observability provides a comprehensive view of the data environment and its overall health. Having this holistic perspective is invaluable as you progress to the next stage of your data observability implementation.
Step 4: Optimize Incident Resolution
At this phase, the focus shifts to enhancing incident triage and resolution response efficiency. This entails establishing clear lines of ownership within the organization. Designating team owners for data quality, as well as overall data asset owners at both the data product and data pipeline levels, is essential.
If not already done, dividing the environment into domains can further enhance accountability and transparency regarding the overall data health maintained by different groups.
Clear ownership facilitates fine-tuning alert settings, ensuring they are directed to the appropriate communication channels of the responsible team and escalated to the appropriate level when necessary.
Step 5: Create Custom Data Quality Monitors
The next step involves implementing more advanced, tailored monitors. These can either be manually defined, such as setting specific freshness requirements for critical data needed by an executive, or machine learning-based. In the latter case, designated tables or data segments are highlighted for examination, with machine learning alerts triggered when anomalies are detected.
It’s advisable to apply custom monitors to the organization’s most crucial data assets, typically those with numerous downstream consumers or significant dependencies. Additionally, custom monitors and service level agreements (SLAs) can be established for different data reliability tiers to manage expectations. For instance, datasets can be certified as “gold” for high reliability or labeled as “bronze” for less robust support.
Leading organizations often manage a substantial portion of their custom data observability monitors through code, integrating them into the continuous integration/continuous deployment (CI/CD) process. This approach streamlines deployment and scalability.
By incorporating monitoring logic into deployment pipelines, organizations like Checkout.com have minimized reliance on manual monitors and tests. They’ve integrated monitoring logic into their code repository, aligning it with data pipelines and facilitating platform harmonization and scalability. This centralized approach also simplifies issue identification and resolution, accelerating time to resolution.
Step 6: Incident Prevention
At this stage of our data observability implementation, we’ve delivered substantial value to the business and notably enhanced data quality. While our efforts have significantly reduced time-to-detection and time-to-resolution, there’s another crucial factor in the equation: the number of data incidents.
In essence, one of the final steps in implementing data observability effectively is to proactively prevent data incidents before they occur.
This involves focusing on data health insights, such as identifying unused tables or deteriorating queries. Analyzing and reporting on data reliability levels or SLA adherence across different domains helps data leaders allocate resources effectively within their data quality management program.
Final thoughts
Throughout this article, we’ve explored various aspects of implementing data observability. Here are some key takeaways:
- Ensure comprehensive monitoring of both the data pipeline and the data it transports.
- Develop a business case for data monitoring by understanding the time spent by your team on pipeline repairs and its impact on business operations.
- When deciding whether to build or buy a data monitoring solution, consider factors such as end-to-end visibility, monitoring scope, and incident resolution capabilities.
- Operationalize data monitoring by initially focusing on broad coverage and gradually refining alerting mechanisms, ownership structures, preventive maintenance practices, and programmatic operations.
Recognize that data pipelines are prone to breaking and data quality issues may arise unless actively maintained. Taking proactive steps to maintain data health is crucial, regardless of your next data quality initiative.