Farid Gurbanov, Cloud Architect

Automated Data Quality Management: Treating Data Like Code

Leveraging Great Expectations for Data-driven Success

Sep 17, 2023

You have plenty of data and are enthusiastic to try a new ML model to boost your sales, integrate with OpenAI or hire a data scientist to dig value in your data. If you don’t monitor metrics of Data Quality (DQ) within your pipelines the chances are that the data you have are either incomplete, inconsistent or outdated.

Managing the quality of the data became a growing challenge with increasing velocity and veracity of data. To build an effective ML model or to see any insights in the data you need a sensible level of DQ.

Historically with the dominance of RDBMS at the core of a corporate DWH, it was usually automated via built-in controls of value type conformity, referential integrity and uniqueness. Inconsistent data would throw an error for support engineers to investigate. In terms of a well known CAP theorem the requirement from the business was to have Consistency and Partition Tolerance (integrity) over Availability.

As the demand for processing vast volumes of both structured and unstructured data within increasingly tight timeframes escalated, the concepts of Data Lake and, subsequently, Data Lakehouse began to surface. These innovations were supported by technologies that prioritise Availability at the expense of either Consistency or Partition Tolerance. The quality of the data is no longer a promise.

Hence, before you are able to try a new ML model and measure its efficiency you have to do tedious validation of DQ. Here are six key quality dimensions to consider:

Accuracy (sense check)
Consistency (between two data sets),
Conformance (to the expected data type),
Completeness (all critical fields in a record be fully populated)
Timeliness (the age of data)
Uniqueness (tracks duplicate data)

For a more comprehensive reading on DQ dimensions and metrics please refer to the Dimensions of Data Quality from DAMA NL Foundation.

The challenges facing Data Quality (DQ) have become increasingly prominent, arguably over the past decade. During this time, I've observed numerous industry solutions emerge, either incorporating DQ functionality within a comprehensive (albeit expensive) package or as standalone external tools. This approach is predominantly geared towards larger enterprises that have undergone extensive Impact Assessment (IA) and recognize its value.

However, for small and medium-sized enterprises (SMEs) with limited profit margins, these tools are often considered luxuries. Furthermore, their use cases typically involve experimenting with basic IA, often conducted through email exchanges, with a focus on rapid experimentation and a willingness to fail fast. In such a scenario, DQ can become a burden that Data Scientists and Data Engineers must grapple with.

The primary challenge lies in automating, version controlling, monitoring, and documenting DQ in a sustainable manner. Essentially, the task is to manage DQ as code and seamlessly integrate it into the Software Development Life Cycle (SDLC).

I've encountered these challenges in all my previous projects and have drafted a configuration-based DQ solution to address them. Fortunately, there's a more comprehensive solution available in the market that has recently garnered significant attention: Great Expectations. This open-sourced platform made its debut at the Strata Data & AI Conference in 2018 and has since experienced rapid growth due to its ability to fill a critical gap in the field. In October 2022, it reached the pinnacle of the Technology Radar with the label "Adapt."

Excited about getting value out of data? Ensure good DQ first. Inconsistent or incomplete data can hinder ML success. Validate DQ for accuracy, consistency, conformance, completeness, timeliness, and uniqueness. Smaller businesses face DQ challenges, but solutions like Great Expectations can help.

Would you like to leave comments or share your likes? You can do so on the version of this article published on LinkedIn.

Automated Data Quality Management: Treating Data Like Code

Home

Recent posts

Topics of interest

Contacts