Don’t Get Caught in the Statistical Cobwebs of Data Quality

November 14, 2012

A couple of days ago, I heard about a data conversion project where the team was taking a statistical sample of source master data and cleaning it.  The discussion I heard was on how big a sample needed to be taken in order to have a 5% margin of error and continued through a series of issues about statistics.  This discussion has brought me up short because sometimes we get so caught up in the mathematics and techniques of our processes that we lose the basic understanding of when different techniques are appropriate.

I applaud the fact that the data conversion project in question had enough foresight to include a data quality stream, certainly not always the case.  Besides the fact that we don’t always know the level of quality of our production data in systems that have been running for many years, it is frequently true that data may have to be additionally cleaned in order for it to be in a state sufficient for the running of a new system to which the data is to be migrated.  The standard method to assess the quality level of the data in the source systems is to take a statistical sample of the source data and assess whether the quality level is sufficient for the target system.  Once we’ve determined how much cleaning of the sample data is necessary to get it into proper shape, we can extrapolate the estimate across the entire set of master data in order to determine how much in time and resources would be necessary to clean the master data.

How does a method for statistically determining an estimate turn into the idea that we only need to clean the statistical sample of data? And even if one person accidentally skipped a few steps in specifying the process, why hasn’t anyone else realized that cleaning a statistical sample of data doesn’t make the entire dataset clean?  Somehow, an entire project team has been dazzled by the implementation of statistics, or no one really thought about it that hard because it wasn’t their job.  Anyway, if you clean a statistical sample of data then only that sample will be sufficiently clean for your target system, the rest of the data will be at the same quality level as the start.

How do we prevent a problem like this?  I believe that a great deal of the problem is that most people like to be as far away from theoretical mathematical discussions as possible because, as Barbie used to say: “Math is hard”.  I think it is important, however, that people with common sense ask questions about project planning and process, even if they don’t understand the complexity of a technical design or approach.  Even very complex issues like encryption and business continuity need to make sense in their implementation and can easily be applied in slightly wrong ways that lose the intent. The economist Kenneth Galbraith proposed in the 1960s that technicians would take over the running of companies because business people wouldn’t understand what the technicians were talking about.  That did not happen because business managers with common sense insisted that the technicians explain the concepts sufficiently to their understanding, regardless of how long it took.  We need to continue that practice with even the implementation of statistics and mathematical concepts in project planning and data management.