My Experiments With Data Quality, A Guest Post By Meghana Bhise

This guest post is by Meghana Bhise, Data Quality Specialist and Solution Architect based in India.

I’m a pretty new entrant when it comes to pure Data Quality. It is something I have dealt with when constructing Data Warehouses and I remember it being something of a nuisance that one just dealt with. You cursed and swore at the people in charge of the dirty data and then did what was needed to clean it and make it presentable to the powers that be.

That changed 3 years ago when my bosses in my new organization in their infinite wisdom decided that I should work on Data Quality projects. 

It seems whoever had constructed their Data Warehouse (DWH) had made a fine mess of it and the "Business” was complaining about the quality of reports that it could produce. I was put to work on what was being called a Data Quality Platform. We had no customers... but we had a tool. We were supposed to wave our magic wand and make everything alright. This of course was far easier said than done. We had to do all of this without touching the DWH or the Extract-Transform-Load (ETL) environment.

Our first course of action was trying to find customers – mainly teams connected with the DWH’s data sources - and we used to run demos of what our tool could do. At that point the emphasis was on the tool and not really Data Quality in the broader sense or what it meant. However this initial approach did help as people became aware of a way to fix their data issues. There were plenty of them… duplication, incomplete records, sometimes even garbage records. These were issues not just with transactional data; I was surprised to find that our so called Master Data and Reference Data also suffered from data issues. It wasn’t any wonder then that the DWH was suffering from quality issues.

I was smart or stupid enough to once ask:  

"Why aren't we enforcing our business rules when data is being gathered?"

The answer I got was that a lot of data was gathered in web forms and if we put too many constraints there, our customers would not even want to open the forms. So we were stuck with whatever we had because we could not touch the web forms.

In some cases the business rules were not even documented anywhere and I remember being a part of a small project which was trying to document those rules for a data source. This data source was feeding into the DWH and one of the ways we discovered business rules was by looking at the ETL - talk of reverse engineering! This data source itself was getting data from several other sources and processes, so we had to go to those sources find out what different fields meant and what constraints they had. All of this had to be done whilst handling the egos of all parties involved. We soon found that documenting business rules was not as easy as it sounded.

There were teams which had started Data Quality initiatives and defined targets for Data Quality without knowing how to achieve them. I remember our first meeting with one such team… for them we were like manna from heaven! We could actually help them! Our first task with this team was to educate them a little about Data Quality and convince them that Data Quality targets could not be defined arbitrarily. You could not just say that by next quarter end you will have 85% Data Quality! You had to profile your data, find the current levels of incompleteness, duplication, etc. You had to define what actions were to be taken for completing the records, how duplicates should be dealt with and most important you had to know your data standards and business rules well. 

One important advantage with this team was that their business rules could be enforced while gathering data and we made sure that they understood how they could help themselves by enforcing the rules. For the remaining data which was already in the system we worked with them extensively to resolve quality issues. These days we are trying to help them in monitoring the Data Quality on an ongoing basis. Working with this team was a fantastic learning experience for us and I have to say we were blessed with an understanding customer.

Slowly but surely other customers have started understanding the value of Data Quality and now there is a separate initiative being set up for Data Governance. How well it succeeds now remains to be seen and I sincerely hope that this doesn’t turn out to be one of those initiatives which started off in the right direction but lost steam midway.

In hindsight, I think I can understand how we came to be in the mess we were in. Initially our IT infrastructure org was a fragmented one, with each major department having its own supporting IT org. Any system development was done more or less in a silo, with no real interactions happening with other departments. This in some way gave rise to a lot of "data inventory”. There was duplication, there was a confusion of standards and there was a confusion of technology as well. When the DWH was created I imagine there was no real focus on the needs of the business, and coupled with the confusion created by multiple source systems running in silos, the result could only be… well maybe disastrous is too strong a word, but yes something along that line.

Another important fact is the lack of documentation of data standards. "Oh so-and-so knows everything about this data” - doesn’t really work does it? What happens after so-and-so retires or leaves? I met people who knew the data and the processes inside out, but they were the only ones who knew. I can only imagine the result of that person leaving (more work for me documenting data standards and business rules for sure. And then who do I go to for that information?). So documentation really is the key!

All in all, it has been quite an educating journey so far! For someone who thought of Data Quality issues as a nuisance, I’ve come a long way. I definitely can’t claim to know everything about Data Quality but now I at least know where it can begin. 

After all, well begun is half done!