Post co-written by Donatien Tessier and Amine Kaabachi
To quote an old advertising slogan: “Power is nothing without control!”
Control for a data platform is mostly a matter of data quality management (have a read of our post on Defining Quality Data and the Various Verification Steps for more on this). Having a powerhouse of a machine that transforms and stores inconsistent data is pointless.
A data pipeline is made up of a series of steps. Typically, the final step is to make the data available to one or more users. Users can adopt a new product quickly, and it can soon become indispensable. They can also lose confidence in that same tool very quickly. Regaining this confidence can take a long time.
This is why ensuring data quality in a data platform is critical. There are three levels of quality, from the most basic to the most advanced:
- Operational health
- Dataset monitoring
- Data quality validation
- Column-level profiling
- Row-level validation
Data Pipeline Status
A data pipeline is a sequence of steps. The first step is to make sure that all the steps in the pipeline are performed successfully. If not, notify those responsible for the operational side of the project.
⚠️ Be careful not to go too far the other way, however. Reporting false positives will cause those receiving the alerts to ignore them, and your risk overlooking legitimate issues.
The steps can fail for various reasons, including path issues, missing libraries, etc. If you use a service like Azure Data Factory and have configured it correctly, you will be alerted when an activity fails.
In some cases, the execution will succeed even if data is missing, for example. So, alerts should be set up to detect unusual behavior and respond appropriately. This is only possible if an exception is raised during development, such as in the case of missing data. These are not strictly business rules but operating rules to be considered during the design phase. This is known as DataOps.
Even if processing is successful, the application may not be healthy. Processing time is an important aspect of operational health. It often degrades over a project’s life cycle. The initial dataset grows over time and can cause performance issues. This may be the case when all data is extracted daily rather than only new and changed data. This should be considered in the project design.
Too often, data engineers focus on datasets without considering the operational context when developing a project. What if the data is missing?
Say some of the information comes from sources A and B. What happens if data from source A is missing from source B or both sources?
A data provider may not be able to add files to a directory, so being able to escalate the problem during daily execution by noticing that a daily file is missing would be useful.
What could be more damaging to a project than a user discovering after a few days that data is not being fed into a report or an application? From an operational standpoint, it’s also very time-consuming to go back through the pipeline looking for the step with the missing data and determining the cause.
Exceptions should be raised early in the processing chain if data is missing. This will reduce the need to investigate the cause of the problem and enable you to respond as quickly as possible.
It’s critical to involve the user in product quality without relying on them. For example, showing the user the last successful data processing date and time can be useful. This gives them control over the information.
Data freshness is critical in a data project. This is because data processing can run regularly and still not contain the latest data. As a result, it’s important to check for data gaps based on a scheduled refresh interval. For example, if data has to be uploaded every five minutes, you need to check that none of the slots have missing data.
In addition to data freshness, volume is another key indicator. It’s easy to raise an alert if the data volume does not match the expected volume, according to the data volume statistics. You will need to use upper and lower tolerance margins to avoid false positives in this scenario.
Freshness and volumetry are based on data predictability. However, one component remains unpredictable: the schema. This may change during the project’s life. Changing a column type, adding a new column, or deleting an existing column are all examples of this.
To prevent type changes, the schema must be frozen rather than inferred (in addition to the performance boost from scanning the data needed for inference).
There are two ways to manage these changes:
- accept them and allow schema evolution (https://docs.delta.io/latest/delta-update.html#-merge-schema-evolution)
- refuse them by rejecting data that does not meet your set rules
In both cases, a change alert is essential to determine whether the choice will have an impact on the solution.
Data Quality Validation
It makes sense to check content quality once we have established the operational and overall status of the datasets in our data pipelines.
In batch processing, checking values row by row is not necessarily recommended. Sample checking is preferable. Because we typically process billions of rows, adding data quality to the entire input data would be hugely expensive.
Consider the following scenarios where we might face a data quality issue:
- A dataset column that is always empty
- A previously full column that is suddenly empty
- A change in the columns as a result of an ingestion issue
What steps do we need to take to detect and raise alerts about such issues?
The general approach is as follows:
- Select a representative random sample for each new batch.
- Calculate and store column metrics like the percentage of empty values or the most common value’s frequency for this sample.
- With the old metrics, use a distribution comparison or anomaly detection approach.
- If there are significant differences, stop the process and perform a manual validation or notify the data team urgently.
Note: For data quality and drift, the Kolmogorov-Smirnov test is the most widely used test for comparing distributions. However, some teams prefer to use machine learning methods like one-class SVM or other anomaly detection methods.
Starting with a library like AWS Deequ in Scala or Python (project fork) and adapting it to the needs of your company and teams is the recommended technical implementation. This library comes with the following features by default:
- Persistence and querying of computed data metrics
- Data profiling for large datasets
- Anomaly detection on data quality metrics over time
- Automatic suggestion of constraints for large datasets
- Incremental metrics computation on growing datasets and update metrics on partitioned data
Row-by-row validation is recommended for streaming batches and small datasets. It’s the safest and most effective way to ensure high data quality and even validate certain business rules when used on a small scale.
Unit tests or quality tests are used to apply the validation rules. Libraries make implementation and results analysis easier.
In Python, “Great Expectations,” an open source library with the following features, has become one of the most popular choices in recent years:
- Assertions and tests for data with abstraction that covers the majority of validation requirements
- Data validation suite management
- Automated documentation generation
- Automated data profiling
The library is extensible and can be used as a basis for a more comprehensive validation framework tailored to your particular needs.
Data Quality: The Key to a Project’s Success
Data quality is one of the most critical factors in a project’s success and retaining users’ trust. Ensuring data quality is difficult, especially if you want to cover all the three levels discussed.
Controls should ideally be planned and implemented during the project design phase. However, data quality degradation is not inevitable and can always be fixed. Depending on the design, this process may take more or less time.
A thorough understanding of the business is essential here.
Post co-written by Donatien Tessier and Amine Kaabachi