Data Observability and Proactive Data Testing: An Analytics Engineer’s Answer to Complexity

Getting ahead of data quality complexities with data observability and proactive data testing.

Data Observability and Proactive Data Testing: An Analytics Engineer’s Answer to Complexity

Last Updated: November 30, 2022

In a typical modern data stack, the variety of data sources stacked in the data warehouse has exploded. Gleb Mezhanskiy, CEO of Datafold, suggests a shift in focus from monitoring live data for anomalies to prevention and addresses how to reduce the number of data quality issues drastically. 

Life as an analytics engineer has become more and more complex as the data warehouse becomes the center of gravity for many sources and uses of data. Initially, the job of an analytics engineer was to take a few sources and model them for internal reporting. Today, they need to wrangle data from a variety of different sources and model that data so that it’s ready for consumption by both people and algorithms.

Consider how a typical company’s data stack grows in complexity. In the early stages, most organizations are dealing with relatively few data sources: the database that runs their application, one CRM tool, and perhaps financial data coming from Stripe. Combining this data into useful tables for analysis and internal reporting is not too complex.

As a company grows, they typically expand the number of data sources they rely upon, which makes modeling the data more complex. They now have to combine information from additional customer support tools, multiple web tracking tools, productivity tools, legacy and new tools, and many other small workflow tools to get an accurate picture of their business. The work is hard but still largely internal-facing.

Once companies have this variety of data now modeled in the data warehouse, there is an opportunity to leverage it beyond just internal reporting. They can use it for data science use cases such as predictive analysis, algorithms to deploy ad spend, or algorithms to surface products that one of their customers is interested in buying. Companies are now also using tools such as Hightouch to leverage warehouse data back in operational tools such as Salesforce and Intercom. This allows marketers to use data in the warehouse to do email marketing segmentation and lead scoring much easier.

Now that data in the warehouse is being used for customer-facing use cases, we need to care significantly more about quality. More than broken tables, incorrect values, and out-of-date information can mean sending emails to the wrong customers, recommending the wrong products, or overpaying vendors. Any time we update a table in a pipeline, we may accidentally be impacting a financially consequential use case. When a data stack is this complex and powering many use cases, we need to prevent any breaks we can see in production data.

The other factor that makes updating tables within a complex set of pipelines more challenging is a growing data team size. This often means multiple people are updating tables each week. It can be challenging to know what the current state of the data model is and what will happen downstream when you make updates.

Analytics and data engineers need tools to provide visibility into how all of this data is connected and what happens to the underlying data and downstream applications if a change is made.

See More: 4 Tips To Implement Observability to Ship High-quality, Secure IoT Products

Getting Ahead of Data Quality Problems

Unfortunately, without tooling, it is not feasible for engineers to thoroughly test all of the impacts their pull request could have because there are too many tables and situations that could potentially go wrong. Many engineers write unit tests for common situations on high-value columns, but that is not sufficient for how complex pipelines have become. Engineers also have their peers review their pull requests, but these reviews are limited in their ability to understand the impact on downstream data. Therefore, many engineers go ahead and merge to production to perform spot checks in key BI reports to see if important metrics were affected.

Beyond testing, review and spot-checking, some engineers implement a monitoring tool on their production data. This functions similarly to checking BI reports but can provide a wider lens into what the live data looks like. Monitoring tools can alert engineers via Slack whenever data looks to be different than it was in the past. Getting these alerts is better than getting an email from the CEO whose KPI dashboard just stopped working properly — but, unfortunately, these issues are in production, which is too late. The other downside to this approach is that data changes often for legitimate reasons, so alerts may be sent for non-critical reasons as well, making it hard to triage.

Proactive Testing: Shifting the Focus to Prevention

So what is the answer? How can data engineers proactively manage complexity and prevent data quality problems from getting into production?

They need tooling with a column-level lineage so that they can have clear visibility into how all of their organization’s data is connected. That, in turn, enables them to make better engineering decisions by letting them see the consequences of a pull request before they merge code to production.

Imagine you’re merging customer and prospect data from Salesforce with web analytics, e-commerce data, and Zendesk. Your goal is to consolidate that information in Snowflake, then run BI reports in Tableau to better understand which prospects convert to customers. You’re using Fivetran to extract and load the data and then use dbt to transform your data within Snowflake. You also want to push quarterly sales totals back to Salesforce using a tool like Hightouch, which can help you segment your customer base more effectively.

Now it comes time to make a schema change to your e-commerce platform. What will that do to the system as a whole? What are the unintended consequences? Will it do damage to your customer segmentation in Salesforce? Will it break an executive dashboard that your C-suite users look at daily?

The ideal time to address data quality issues for code changes is before the change merges. Whether finding issues by exploring column-level lineage or integrating automated testing into CI, you should know how your change will cascade throughout your data pipelines and tools. Proactive data testing is about automating the analysis of their changes’ impact on all systems so that the data engineer can avoid breaking anything upstream or downstream of those changes.

None of this is intended to say that proactive testing is a panacea for data quality. It is not. There are plenty of other potential reasons why data can go awry. Externally sourced data may change unexpectedly, infrastructure bugs may lead to corrupted information, or missing records may be dropped during the replication process. Some information may simply be wrong from the outset, and the accuracy of some data will naturally decay over the course of time. Customer databases are infamous in that respect.

All of these issues call for their own specialized tools and approaches, but for data quality issues that arise from the kind of complexity and scale that the MDS is intended to address, proactive data testing provides a critical way to prevent data quality issues before they happen.

Do you think proactive data testing is the answer to data quality issues? Tell us on  FacebookOpens a new window , TwitterOpens a new window , and LinkedInOpens a new window .

MORE ON OBSERVABILITY:

Image Source: Shutterstock

Gleb Mezhanskiy
Gleb Mezhanskiy is founding CEO of Datafold, a data reliability platform that helps data teams deliver reliable data products faster. He has led data science and product areas at companies of all stages. As a founding member of data teams at Lyft and Autodesk and head of product at Phantom Auto, Gleb built some of the world’s largest and most sophisticated data platforms, including essential tools for data discovery, ETL development, forecasting, and anomaly detection.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.