A common problem I see in data management circles is the confusion around what is meant by data profiling as opposed to data quality assessment.
Some people tend to use the terms interchangeably and it’s easy to see why.
When we first plug a data profiling tool into our data sources it can help us create a huge amount of insight into the quality levels of our data. We believe that these early investigations are in fact assessments of the data that we can begin to circulate.
The following definition of the word assessment helps us understand where many people go wrong:
assessment: to determine the value, significance, or extent of;
A lot of people use data profiling as the start and end point for their data quality assessment and as a result they lack an ability to determine whether the profiling results are:
- Valued in a balanced and correct way
- Significant to the business
- Reflect the true extent of a particular issue
The problem is that we’re actually missing a few key stages so let’s extend our discussion with a more comprehensive workflow.
Step 1: Data Profiling (Data Quality Requirements Discovery)
Here we are using our data profiling software to begin the process of discovery, not assessment. We’re looking to find rules and requirements that will help us to perform a more thorough data quality assessment in a later step.
For example, data profiling can help us discover value frequencies, formats and patterns that lead us to believe that a particular attribute is a product code. Using data profiling alone we can find some perceived defects and outliers but in terms of assessing the quality of the code it will fall short until we have created more rigorous definitions of quality.
There are other important considerations such as:
- Does the code have a viable business function or is it now redundant?
- Is the quality of the code determined by other attributes, for example manufacturer code or some other combination of attribute values?
- Can we decompose the code to extract more information that will help us validate the quality of its value?
So, with our very first profiling activity we’ve actually started a process of data quality requirements gathering, not assessment, that will come later.
Step 2: Data Quality Requirements Creation
Armed with our data profiling insights we can now start to define some data quality rules that our data must adhere to. Why must we do this? Simple, we need a means of comparing the quality of our data against an approved set of criteria. Data profiling results alone simply publish findings, there is no approval rating or contextual validation at all.
For example, in a previous assignment we discovered major issues with location information across a wide range of inside plant equipment for a utilities organisation. According to the profiling results the figure was bleak, 40% of equipment had a missing location value.
However, this profiling figure gave us no means of true data quality assessment because:
- A huge proportion of that equipment was actually retired or assigned to spares
- A great deal of equipment belonged to other partners and was therefore out of scope
- Some equipment was actually mastered in another system so depending on the equipment type it was important to gather location data from another source
So, as we can see, the data profiling function can help us uncover these rules and requirements but profiling cannot give us an accurate assessment until we have defined and built the rules somewhere.
Step 3: Data Quality Assessment
Okay, we’ve profiled our data, discovered a wide ranging set of data quality requirements or rules and now we need to put our rules to the test.
We assess the data across our rules base and record the passes and fails to create a true assessment of data quality.
(Obviously taking a purist stance the only way we can make a true assessment of data quality is to validate the real source of the data but this is obviously impractical in most cases).
So in our earlier example we would assess the location of our equipment based on a far more stringent set of rules than profiling data would give us. We may use profiling functions to validate the function, length, code values and substring values against our data quality requirements but the goal is to determine whether each value passes or fails against an approved set of criteria.
Using this approach we can build a much clearer picture of data quality "health”. Many companies instantly panic when they first run data profiling software on their data that highlights vast amount of defects. However, if they understand the bigger picture and start to move through the profiling, requirements gathering and data quality assessment phases they start to get a far more balanced and subjective view of how bad or good their data really is.
What do you think? Do you feel data profiling and data quality assessment are the same or different terms? Welcome your views on this topic.