Misconceptions of Data Quality on Data Migration Projects, Featuring John Morris
Data migration projects have an extremely high failure rate and one of the principal causes is a lack of effective data quality management across all aspects of the project.
To help members reverse the trend of project failure we have enlisted the support of John Morris, author of Practical Data Migration, creator of the PDMv2 Data Migration Methodology, the only industry standard methodology for data migration currently available.
John is also the creator of Data Migration Matters, a unique event for the Data MIgration sector. John also provides great advice via his BCS Blog.
Data Quality Pro: Welcome the 1st online coaching session with John Morris from Iergo. John is also the author of “Practical Data Migration”, published by the UK BCS.
The reason for this call is that following a questionnaire with Data Migration Promembers, data quality was repeatedly the biggest issue cited for members. We willuse this call to explore some of the key misconceptions surrounding data quality andcommon mistakes that companies make. We will then introduce some best-practices to resolve them.
If we start John with the first misconception which is that organisations can “fix”their data quality issues with data quality tools.
John Morris: It’s a common misconception we have with software, most of us come from a technicalbackground and often see solutions in terms of software. I should say this is not toobelittle the role of tools on a migration project however there are many issues whichcan’t really be touched by software.
The most obvious one is what I call the “reality check” between what is in the database and what is actually happening in the real world.
The software can perform various checks but it can’t tell you what is accuratecompared to the real world.When I first specialised in data migration I worked on a migration for a water andwaste utility project. We loaded our data quality scripts and went to show thebusiness the data they would be getting in the target system.
I demonstrated this data to the area manager.
He examined the data and then showed me two extremely large pumps outside the office in the main plant, howeverour data only had one pump –
“Can you count he inquired?”
So from then on all the data in our dataset was compromised in terms of business support because they believed if we couldn’t manage the large items how could we manage the rest of the inventory?
It doesn’t matter how good your data quality software is there are always going to be issues the tools can’t touch.
Aside from the reality-check issue are the semantic rules of the data.
Many datasets are more than 20 years old and when you get into a migration you will identify large numbers of anomalies in the data caused by these local differences in meaning embedded in the data.
Data quality tools will identify these anomalies but they won’t be able to answer those semantic issues.
So you need to find a way of combining the tools with the knowledge of the business community to find the hidden semantic issues.
Data Quality Pro: Can an over-reliance on tools create more problems than they solve because you will typically find thousands of data issues?
John Morris: Yes, I agree.
There is a tendency to take a techno-centric approach. People go after the problems that they can fix, this happens in a number of ways.
It’s kind of like the hammer and nail scenario - if your only tool is a hammer, you see every problem as a nail to smack!
For example, some people go after only the data their tool has an adaptor for.
They ignore all the spreadsheets, paper records and local data stores which can contain vital legacy information which is missing from legacy RDBMS type systems. Plus, I often go to data quality vendor presentations which show name and addresscleanse when my client has non-customer data.
So, it’s really a word of caution, not looking to play the role of tools. They do help usmove forward much faster but I just want people to realise that data quality tools arenot a magic bullet.
Just as an aside, what happens when you move technology to the centre of theproject is that service providers can exceed what can really be delivered. When dataquality issues are found that can’t be fixed with tools then it is perceived by the client that you are going back on your word, reneging on the implicit contract with theclient.
So you now go back to them asking for their help and you often find the help isn’tforthcoming.
So for all these reasons it is necessary to position your data quality tools as anecessary key component of the data migration but not the one and only, the centralpart of the data migration and data quality environment.
Data Quality Pro: This leads us onto the next question which is that a lot of customers assume because they are putting in a new system they want and are expecting perfectquality data.
John Morris: I always say that there are a number of golden rules in data migration and one ofthem is that:
“No company needs, wants or is willing to pay for perfect quality data”.
When I make this claim, quite often this is met with some challenge on the part ofthe clients. They have spent a lot of money on a new COTS system for example sowhy can’t they have perfect data?
The problem is that data migration projects are time bound, there is a hard stop or abusiness case to finish the project on a specific point.There are drivers to finish the project and realise the benefits of the new software.This is at odds with the ongoing type data quality environments which a lot of dataquality tools come from.
A lot of data quality improvement projects are cyclical following a measurement,improvement, control type cycle going on continuously.
We don’t have that luxury on a data migration project. We have to get to the end ofit using data quality that is good enough to support all the business and technical constraints.We are not going to the point of perfect quality data.
Going back to my earlier point, we need to prioritise with the business community toidentify what features can be switched on coupled with the data to support thosefeatures.
The take-away from this is that yes, we strive for good quality data but due to timeand money constraints we have to build prioritisation into our data migrationprocesses and this must include a significant amount of business input.
Data Quality Pro: That’s a really good point. How do you lay out your approach to prioritisation sothat it is managed effectively?
John Morris: In my book (see below) I talk about my data quality rules process.
This involves stakeholders from the project and business side.It is rolled-out from the start of the project right through to the end. So right from early data discovery analysis all the way through to migration implementation.
That data quality rules process relies on a steering group or control board that has amix of domain experts and system experts coming out with a balanced view as towhat should be prioritised.
I find that if you have an adult and peer to peer conversation with the business youhave a far more productive relationship than a situation where the project “tells” thebusiness how things should be done.
So a key activity is to use the data quality rules process within your data migrationstrategy.
Data Quality Pro: The final misconception you noted is that there is an expectation that there willbe a fallout rate in migrations and customers should simply set up a technical teamto manage these fallouts.
John Morris: The Bloor report demonstrates that 80% of data migration projects run late and over budget. As we’ve seen from the survey its data quality issues that more oftenthan not causes a project to fail or delay. Its rarely the software that causes the delay.So the expectation is that the project is going to fail.
A big factor in this is that many people come into data migration from a data qualityprocess or ETL background.Data quality processes often involve a lot of cleansing and post detection correctionin a cyclical process.
We can’t afford a delay in data migrations and I always aim for a zero defect datamigration.I believe that any methodology that aims for less than that is not complete.It is the philosophy of defeat to create a migration strategy that focuses on falloutmanagement.
The only caveat is that sometimes it is permissible, for example when you are doingyour data quality rules strategy and doing your prioritisation sometimes it is easier tofix things in the target.
When you have a few million records and a handful of records have issues that willrequire manually entering into the target it is sometimes easier to let the migrationreject those records.
However, we are still managing these issues effectively and are not ignoring ourresponsibilities for identifying every defect.
Data Quality Pro: Thanks John, we’ve got some questions coming in: “We’re using a supplier to help with our data integration project and the project is nearing completion. They are now taking the stance of not fixing any issues unless we really push for them”.
So the supplier is clearly trying to protect their contract and bonus payments by restricting their workload and hitting the contracted due date.
John Morris: The first issue is – “How do we write our contract?”
I think there is an issue with increasingly tight contracts. You have to remember whateveryone’s commercial line is going to be on this. If there are tight penalty clausesthen obviously people are going to migrate whatever they can in the data and claimit’s completed. They are also going to be less caring about things like reality checks[checking the data against actual reality].
I think there is a lot of work to create contracts that contain the right level of risk andreward.
When it comes to the question “we’re not going to fix anything”, this is kind ofopposite to the “we want perfect quality data”.The problem with this is that the old systems have hundreds of workarounds andlocal processes to get around the data defects.
The new systems don’t have these workarounds so we expose these defects to thenew business services and public scrutiny as many of the new systems bring thecustomer closer to the data.So customers have to get used to the mentality of addressing data from the outset.Sorry can’t help your problem there Jim straight now. How about doing somecontract changes where both parties can benefit.
Data Quality Pro: Another question from Jenny, a member of Data Migration Pro:
“How important do you feel it is to compare the data with reality. Is the job of checking the data against reality the job of the migration team or the organisation”.
John Morris: I believe it is part of the migration project team responsibility. We’re not doingthese things in isolation, on day one when a system opens up for use we want to putin front of the business, data they can use to perform their job.
If the business perceive the data to be of low quality it can impact trust and peoplestart to create their own spreadsheets and local records to work around thedefective data.
You therefore run the risk of losing the benefits of the business case.It comes back to the priority that the business has placed on matching the data toreality.
For example, air traffic systems have to be absolutely accurate.Other parts of systems don’t have to be that accurate. For example, some systemsthat have a lot of field data eg. pipes underground which can’t be dug up. In thesesituations its impossible to do a reality check so a process to record accurate dataduring field servicing for example needs to take place.It is often too expensive to carry out reality checks so prioritisation with the businesscommunity is paramount.
Data Quality Pro: Another question:
“In your book you write about first-cut and second-cut data quality rules. With first cut assessing legacy data quality and second-cut assessing the data quality rules legacy fit with the target system. Isn’t it a waste of time and money to do the data quality analysis of legacy data that will never be migrated”.
John Morris: In a theoretical world we would wait for the target environment to be completedand then undertake the analysis.However, in the real world we never have enough time to do all the data qualityactivities we wish to undertake. Often the systems can be delivered within weeks ofthe migration implementation.
It comes down again to a balance of risk and reward.
In a practical sense I can only think of one migration project many years ago wherethe fundamentals of the target were changing so much that there was wasted dataquality effort.You find that in most cases there are key data items that will need to be migratedand with high quality as there is always information core to the business.
Then of course there are other technical pieces of information that the legacy systemneeds that the target doesn’t.
The thing I say about data quality rules is that yes do the data quality analysis againstthe legacy but you don’t necessarily need to do any fixing of it.If it’s a questionable item and you’re not sure if you need to correct it before youmigrate it, by all means postpone your activity on that.
When I created the book (Practical Data Migration) I did lay the book out in awaterfall fashion because it’s easier to lay out in a book format but in the nextedition I will have to explore iterative approaches to data migration.With an iterative approach it’s worth thinking about where data quality issues lie.
We talked about the reality issue, the gap between what’s in the system and what’sin the real world.There is also the issue of the gap between the data structures of the legacy and thetarget. We all know that there will be differences between how data models andstructures are created between these two environments but we then create thesource to target mappings.
However, it’s important not to ignore the gap between the legacy data models andthe legacy data itself. If we don’t at least record what those gaps are and we baseour data migration mappings solely on the legacy model we will get a lot of errorsfalling out of migration processes due to a lack of understanding of the gaps.
I have never found it a problem to prioritise those things that we know just have tobe right when we do the migration.There is a very low risk that you might do work that you don’t need to do.
But looking at the timeline of most projects which is typically 12-18 months thatprovides a great deal of time to carry out data quality work when you’re not underpressure, if you wait until the end of the project to start prioritisation issues itbecomes a mad dash and it becomes a “fix-in-the-target” approach. The system thenstaggers from one data issue to another in its early existence.
I would strongly recommend you do the data quality analysis against the source when you have the time to do it.
Data Quality Pro: There is a misconception that DQ often means you are going to fix things which is not correct. The term landscape analysis is a good one as it means we are examining the various system landscapes in our scope to find the relationships, structures and quality metrics of the data, this doesn’t always require cleanse.
Got a few technical questions which we’ll cover on another call because wecan’t get deep into data cleanse at present.
One question just in:
“John says that in his book a company does not need, want or is willing to pay for perfect DQ – after initial analysis once you have identified a range of DQ issues – how should you prioritise what to fix? And what are the key DQ issues you should focus on.”
John Morris: Firstly, you have to prioritise on things that the new system simply won’t work with or at least significant components won’t work with because big COTS packages in particular have many functions that may or may not be activated.
The second issue that need to prioritise on are those that are going to break business processes that we get wrong.
The third issues are those that are going to embarrass the company, I think a few of us will probably be thinking about Terminal 5 at Heathrow with this.
We are increasingly in a public world now and the more open nature of corporate data stores as well as through the freedom of information act means we can no longer hide behind our corporate data.
How do we manage to manoeuvre and prioritise between these?
In terms of the first issue, you need to ignore the “bells-and-whistles” features and focus on data that supports key functionality.
If we are going to prioritise business reality we are going to prioritise those things that are going to get you by for a few weeks until you fix the errors.
In terms of the last issue you really need to find the defects that are going to cause the organisation to lose customers.
The key to success here is to create a balanced prioritisation team so that at the very start of the project you have all the major stakeholders together, technical and business.
The business side must include people who understand the business relationship issues.
Data Quality Pro: We have another question:
“What is the significance of having business analysts on the project”
John Morris: Is the point of the question perhaps “Should we not just have technical staff on a data migration as we’re basically dealing with data?”
The need for a business analyst stems from the fact that a business analyst stems the bridge between the IT and the business community.
However in this situation though I would say that the business analyst needs to be trained as it is different to a normal BA role.
You have to work in a lot more agile manner because you will be creating a lot of temporary business processes.
You will need to carry out data analysis and understand exactly what goes on in a data migration process.
We need a training course for a new breed of data migration analysts who have this combination of skills and subtle understandings that are needed on a data migration project.
Data Quality Pro: I’ve got a quick question from Malik, London:
“We often get data from legacy systems that is either not complete or in the correct format. This often contains characters that are unknown. Do you think ETL tools can be effective in these cases.”
John Morris: You have two options here.
Most data profiling and data quality tools can find data anomalies like the ones you mention.
A lot of the ETL vendors in particular now integrate data quality tools they have bought out into their platforms. So there are plenty of options out there to trap these issues so you shouldn’t need to resort to hand-coding any more.
Data Quality Pro: We’ve come to the end of the session now.
In the next session we’ll talk a lot more about your data quality rules process John and also how we structure the teams. Thanks for coming on the show.
John Morris: My pleasure Dylan.