Measuring Data Quality for Ongoing Improvement: Interview with Laura Sebastian-Coleman
Perhaps the most critical aspect of any data quality initiative is the ability to perform effective data quality measurements so that an accurate assessment and roadmap for improvement can be sought.
To help readers gain practical insight into the methods and pitfalls of data quality assessment and measurement I asked data quality expert Laura Sebastian-Coleman to provide some insights.
In this interview Laura draws on numerous techniques and methods from her recent publication released by Morgan Kaufmann: Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework.
The book is also available from Amazon.
Dylan Jones: What are some of the common mistakes you see people making when they first start out with data quality assessment activities?
Laura Sebastian-Coleman: The biggest mistake people make is failing to articulate their goals in conducting the data quality assessment and not planning the process to meet these goals. Without clear goals, it is hard to execute an effective assessment.
From my experience, two obstacles often prevent people from establishing goals related to the data:
A focus on tools rather than the data and the assessment process
Incorrect assumptions about the nature of and solution to data problems
When people from IT approach data quality assessment they often start with the question of what tool to use. Data assessment then becomes a question of tool selection, rather than tool use.
It is easy to get caught up in the idea that a data profiling or cleansing tool will solve data problems. It won’t, any more than a hammer, by itself, can build a porch. The choice of tools should be driven by the goals of the assessment.
In contrast, business people often get caught up in the drama of data issues. For some people there is a governing narrative about how to improve data quality. It goes like this: People recognize that their data is not right. They have ways of explaining some of its condition – data entry errors, complexity within source systems, indifference upstream to downstream uses of the data. But, they think their problems cannot be as mundane as all that. Data assessment becomes a quest to find a single criminal – The Root Cause – rather than to understand the process that creates the data and the factors that contribute to data issues and discrepancies. They start hunting for the one thing that will explain all the problems. Their goal is to slay the root cause and live happily ever after.
Their intentions are good. And slaying root causes – such as poor processes design – can bring about improvement. But many data “problems” are symptoms of a lack of knowledge about the data and the processes that create it. You cannot slay a lack of knowledge. The only way to solve a knowledge problem is to build knowledge.
So, in my thinking, one goal of data assessment should always be to build knowledge of the data. Doing so requires being skeptical about what information you have about the data and freeing your mind of assumptions (that is, what you think you know about the data) in order to concentrate on what you actually see in the data. And trying to be as objective as possible about what you are not seeing.
I like how both Jack Olson (in Data Quality: The Accuracy Dimension) and Arkady Maydanchik (in Data Quality Assessment) talk about the relation between metadata (explicit knowledge of data) and data quality assessment. You need a starting point for assessment, but you also need to use the assessment process itself to improve your metadata.
David Loshin (in Enterprise Knowledge Management: The Data Quality Approach) addresses the question of data assessment and knowledge from a slightly different angle, pointing out that data itself embeds knowledge and assessment can enable re-discovery – so an organization can re-learn what it has forgotten it should know.
The New Oxford American Dictionary defines assessment as “the process of evaluating or estimating the nature, ability, or quality of a thing.” ASQ defines assessment as “A systematic evaluation process of collecting and analyzing data to determine the current, historical or projected compliance of an organization to a standard” (ASQ). As a synonym for measurement, assessment implies the need to compare one thing to another in order to understand it. But assessment implies drawing a conclusion – evaluating – the object of the assessment whereas measurement does not always imply so.
So another goal of any data quality assessment is to evaluate data against a set of expectations and draw conclusions about the data’s suitability for different uses.
Data Quality Assessments should produce a deliverable that documents degree to which data meets expectations and the ways in which it does not meet them.
Expectations can be formalized in a number of ways: as standards, requirements, or metadata. Even knowing that you do not know a lot about the data can be a starting point. Just diving into the data without a plan or without a target deliverable does not always produce usable results – and it can take a lot of time, probably more time than if you had a plan. As is true of other projects, so too with data assessment. Having goals, a plan, and a target deliverable allows you to know when you have brought the analysis to a useful level of completion.
Dylan Jones: What steps do you typically incorporate into a data quality assessment framework?
Laura Sebastian-Coleman: The DQAF (Data Quality Assessment Framework), which is the subject of my book, describes three large categories of assessment:
Initial Assessment aimed at gaining an understanding of the data to be assessed and data environment in which the assessment will take place;
Process Controls and in-line measurement aimed at managing data within a data store and
Periodic Measurement, also aimed at managing data within or between data stores.
While these different types of assessment have different goals, approaching them requires a similar set of steps:
Establish the goals of the assessment
Define the approach and deliverables, including what data will be in scope and out of scope for the process
Document the expectations against which data will be evaluated
Conduct the analysis itself (by making observations about the data and comparing the data to expectations)
Evaluate the results of the analysis
Share the results with stakeholders
Prioritize follow up actions, using the results to improve the quality of data
Putting them in bullets makes the process seem easy, but assessment is not always easy so here are some more practical ideas for your members:
Establish the goals of the data quality assessment – As I noted under the earlier question, sometimes groups do not establish specific goals for data quality assessment. Without these, it is hard to structure the work that follows. Goals can be refined throughout the assessment process. Early in a project, assessment may be needed simply to determine whether data content will meet the needs of an application’s data consumers. Later in the project, assessment may be needed to determine whether specific rules are working as expected or fields are populated as required. Assessment should also look for risks, for example, identifying whether there are obstacles to integrating data from a particular source into a data store. One goal should always be to build explicit knowledge of the data through clear, usable documentation.
Define the approach and deliverables, including what data will be in scope. One of the dangers of assessment projects is analysis paralysis. It is important to limit scope and to be clear about what you will deliver in the end. For example, a column-by-column set of observations, summarized observations, recommendations for improvement, etc. Since assessment should always improve understanding of the data, the plan should include a means of systematically capturing results of analysis.
Documenting the expectations against which data will be evaluated. People struggle with the concept of expectations for data. Some data seems obvious: a name is a name is a name. Or people say, “I can’t tell whether the data is good; only the data consumer can say if the data is of high quality.” With all due respect to data consumers, I think most knowledgeable analysts can also define expectations related to common sense, the representational or semantic aspects of data, the storage characteristics, and rules implied by business processes or data models. (For example, we can surmise that records with a first name of “DO NOT USE” may indicate a problem which we can begin to explore first by asking staff from the source system why the first name “DO NOT USE” is present in the records.) At the very least, as a starting point for expectations, an analyst should know the intention of the business processes that produce the data. And of course, in some organizations, expectations are documented in various forms. Culling through documentation to understand both business and data rules provides a solid starting point for expectations.
Conducting the data quality analysis itself. The analysis part consists of comparing actual data to the expectations for it. Analysis will need to take place at several levels – column, structure, rule, cross-file, cross-system. At Optum, we have defined a protocol for analysis, starting with an overall assessment of the data set, then diving into column details, and moving outward to structural aspects of the data. The protocol includes conditions to look at (cardinality, level of population, adherence to documented rules, adherence to implied rules) and describes why these are important, what they might imply about the data. The basic drivers within this protocol are described in Chapter Seven of Measuring Data Quality for Ongoing Improvement.
Evaluating the results of the data analysis. Evaluating the results of analysis requires stepping back and synthesizing what you have learned. This process should tie back to goals of the assessment. You should be able to describe what characteristics of the data conform to expectations and which characteristics do not. You should also be able to quantify the degree of difference from expectation – which means you need to determine how to measure the data in relation to the expectations. If one of the goals of the assessment is to make a decision about whether to use the data, evaluation should enable you to answer that question or to identify additional steps that are needed to answer it.
Sharing the results with stakeholders. Stakeholders will come to the process with a set of assumptions about the data. Their questions may require that you extend the scope of the assessment by including additional data or that you dive more deeply into the data you have already reviewed. Based on feedback from stakeholders, you may need to repeat the basic steps of the assessment: establish the goals, define the approach, deliverables, and scope.Using the results to improve the quality of data. Ultimately the goal of data assessment is data improvement. In some cases, improvement comes as a result of the knowledge gained through an assessment. For example, if you discover a rule that people had forgotten, or if you can see a relationship between record types that explains a perceived inconsistency in the data, then the assessment itself allows you to make better use of the data. But if you find a systematic problem that requires technical changes, then the work to fix that problem will need to be prioritized against other issues in the organization.
Dylan Jones: Where do you stand on the data quality software debate, do you feel specialist tools are a critical element of succeeding with a data quality assessment?
Laura Sebastian-Coleman: There are a number of really excellent profiling tools on the market. These tools are designed to surface characteristics of data that can be very helpful for anyone who wants to gain basic understanding or do a deep dive into the data. They can also show you characteristics that you might not be looking for (for example, most have built-in capabilities for detecting patterns in data; these capabilities are more systematic at pattern detection than people are), which means they can contribute to the process of gaining knowledge about data in a way that profiling “by hand” cannot.
That said, the tools do not solve problems on their own and, even though they are marketed as if they can do “analysis” they don’t do the real work – the drawing conclusions part – of assessment.
As I noted earlier, if the process of tool selection is put before tool use, then organizations position themselves for failure. Tool selection should be based on establishing requirements for tool use. Requirements should be based on the goals an organization has for its data or on the needs of the data consumers for specific projects related to data.
An organization that has a clear plan can get a lot out of a tool. An organization that does not have a clear plan may end up blaming a tool for its lack of success in understanding its own data. And profiling should be about gaining an understanding of data – assessing it.
If there is lack of focus on a set of deliverables that capture the understanding people have about the data, then the results of the real work of profiling will dissolve into the ether.
Most tools marketed as data quality tools focus on profiling and are intended to be used as part of the data discovery process. Their focus is initial data assessment and the marketing around such tools often starts with the assumption that the people using them do not know much at all about their own data. There are some others marketed as “quality” tools that are really cleansing tools. To use these effectively requires deep knowledge of the data that allows for the establishment of cleansing rules. In most cases, the work of establishing these rules would be better spent on making improvements in the ways data is produced in the first place, rather than in the ways it can be “cleaned”.
What I have not seen is any tools that effectively measure the quality of data as it is moved along the data chain. The DQAF describes data assessment in relation to initial discovery, in-line measurement, and periodic re-assessment, but its focus is largely on in-line measurement. I have not seen any tools that do what the DQAF asserts should be done with respect to in-line measurement. At Optum, we have built several processes to accomplish the goals of DQAF measurement types. We take measurements within our ETL processes and store the data for purposes of trend analysis, as well as to detect obvious anomalies.
So… to answer your question directly: I don’t think that specialty tools alone will get people very far. However, if an organization prepares itself to do purposeful analysis of its own data, then the right tools will take them further faster.
Dylan Jones: With your expertise in this area, what are some of the key tips for data quality assessment you would like to share with our readers?
Laura Sebastian-Coleman: I see I have gone on quite a bit on the previous questions, so I’ll limit these to three important points.
Data is always created by a process or a set of processes. Data assessment requires developing knowledge of the processes that create data.
Don’t believe everything you think. Always ask questions about the data you are looking at and about your own assumptions.
Assessment requires both a meaningful approach to measurement and evaluation of measurement results. If you don’t know why you are taking a measurement, you need to figure that part out or figure out what kind of measurement will help you understand the data you are looking at.
About the Author - Laura Sebastian-Coleman
Laura has been a data quality practitioner at Optum (formerly Ingenix) for over nine years. Optum provides information technology solutions to the health care industry. During that time, she has been responsible for establishing a program of data quality measurement for two large health care data warehouses.
She has also been involved in a range of data quality improvement and data governance efforts. The Data Quality Assessment Framework (DQAF) described in Measuring Data Quality for Ongoing Improvement, was developed in conjunction with these efforts.
In past lives, Laura has worked in public relations, corporate communications, and as an English professor.
Link to book website: http://store.elsevier.com/product.jsp?locale=en_US&isbn=9780123970336