Don’t Get Caught in the Statistical Cobwebs of Data Quality

November 14, 2012

A couple of days ago, I heard about a data conversion project where the team was taking a statistical sample of source master data and cleaning it.  The discussion I heard was on how big a sample needed to be taken in order to have a 5% margin of error and continued through a series of issues about statistics.  This discussion has brought me up short because sometimes we get so caught up in the mathematics and techniques of our processes that we lose the basic understanding of when different techniques are appropriate.

I applaud the fact that the data conversion project in question had enough foresight to include a data quality stream, certainly not always the case.  Besides the fact that we don’t always know the level of quality of our production data in systems that have been running for many years, it is frequently true that data may have to be additionally cleaned in order for it to be in a state sufficient for the running of a new system to which the data is to be migrated.  The standard method to assess the quality level of the data in the source systems is to take a statistical sample of the source data and assess whether the quality level is sufficient for the target system.  Once we’ve determined how much cleaning of the sample data is necessary to get it into proper shape, we can extrapolate the estimate across the entire set of master data in order to determine how much in time and resources would be necessary to clean the master data.

How does a method for statistically determining an estimate turn into the idea that we only need to clean the statistical sample of data? And even if one person accidentally skipped a few steps in specifying the process, why hasn’t anyone else realized that cleaning a statistical sample of data doesn’t make the entire dataset clean?  Somehow, an entire project team has been dazzled by the implementation of statistics, or no one really thought about it that hard because it wasn’t their job.  Anyway, if you clean a statistical sample of data then only that sample will be sufficiently clean for your target system, the rest of the data will be at the same quality level as the start.

How do we prevent a problem like this?  I believe that a great deal of the problem is that most people like to be as far away from theoretical mathematical discussions as possible because, as Barbie used to say: “Math is hard”.  I think it is important, however, that people with common sense ask questions about project planning and process, even if they don’t understand the complexity of a technical design or approach.  Even very complex issues like encryption and business continuity need to make sense in their implementation and can easily be applied in slightly wrong ways that lose the intent. The economist Kenneth Galbraith proposed in the 1960s that technicians would take over the running of companies because business people wouldn’t understand what the technicians were talking about.  That did not happen because business managers with common sense insisted that the technicians explain the concepts sufficiently to their understanding, regardless of how long it took.  We need to continue that practice with even the implementation of statistics and mathematical concepts in project planning and data management.


People who Tweet about Data Management

April 30, 2012

Data Management & Architecture

Karen Lopez @datachick

Neil Raden @NeilRaden

Robin Bloor @robinbloor

M. David Allen @mdavidallen

Sue Geuens @suegeuens

Mehmet Orun @DataMinstrel

Alec Sharp @alecsharp

Loretta Mahon Smith @silverdata

Eva Smith @datadeva

Corine Jasonius @DataGenie

Peter Aiken @paiken

Tony Shaw @tonyshaw

Glenn Thomas @Warduke

Bonnie O’Neil @bonnieoneil

Rob Paller @RobPaller

Pete Rivett @rivettp

Charles T. Betz @CharlesTBetz

Tracie Larsen @RelatedStuff

Wayne Eckerson @weckerson

Julian Keith Loren @jkloren

Christophe @mydatanews

Steve Francia @spf13

Gorm Braavig @gormb

Jim Finwick @jimfinwick

Alexej Freund @alexej_freund

Corinna Martinez @Futureatti

Data Quality

Jim Harris @ocdqblog – blog

David Loshin @davidloshin – blog

Rich Murnane @murnane

Daragh O Brien @daraghobrien

Jacqueline Roberts @JackieMRoberts

Steve Tuck @SteveTuck

Vish Agashe @VishAgashe

Julian Schwarzenbach @jschwa1

Henrik L. Sorensen @hlsdk

MDM and Data Governance

Jill Dyche @jilldyche – blog

Charles Blyth @charlesblyth

Steve Sarsfield @stevesarsfield – blog

Dan Power @dan_power

Philip Tyler @tylep0

Business Intelligence and Analytics

Marcus Borba @marcusborba

Tamara Dull @tamaradull

Claudia Imhoff @Claudia_Imhoff – blog

Scott Wallask @BI_expert

Peter Thomas @PeterJThomas – blog

Barney Finucane @bfinucane

Matt Winkleman @mattwinkleman

Stray_Cat @Stray_Cat

Brett2point0 @Brett2point0

Risk Management

Peter Went @Bank_Risk

Joshua Corman @joshcorman

Michael Rasmussen @GRCPundit

Nenshad Bardoliwalla @nenshad

Gary Byrne @GRCexpert

Helmut Schindlwick @Schindwick

Technology Companies and Data Organizations

Oracle @Oracle

DAMA international @DAMA_I

McKinsey on BT @mck_biztech

SmartData Collective @SmartDataCo

DataFlux InSight @Datafluxinsight

Gartner @Gartner_inc


Scientific  Computing @SciCom

Wearecloud @wearecloud

CloudCamp @cloudcamp

Panorama Software @PanoramaSW

Data Hole @datahole

BI Knowledge Base @biknowledgebase

EnterpriseArchitects @enterprisearchitects @dataqualitypro

RSA Archer eGRC @ArcherGRC

Exobox @Exobox_Security

EA_Consultant @EA_Consultant

Cloudbook @cloudbook

ID Experts @idexperts

IAIDQ @iaidq

EMC Forum @EMCForums

Data Junkies @datajunkies

True Finance Data @truefinancedata

Madam @TheMDMNetwork

IBM Initiate @IBMInitiate

Accelus_GRC @PaisleyGRC

DQ Asia Pacific @DQAsiaPacific

Data Guide @DataGuide

PCI PA-DSS Data @DataAssurant

DataFlux Corporation @DataFlux

If the Data Quality got better but no one measured …

November 2, 2011

There is an old philosophical question: “If a tree fell in the forest but no one heard it, did it make a noise?”  The basis of the question being that every time we’ve seen a tree fall in the past it has made a noise, but if no one heard it fall then maybe this one time it didn’t … but you couldn’t prove it either way.

Centrally important to certain areas of Data Management such as Data Governance, Master Data Management, and especially Data Quality is the absolute importance of metrics and measures.  You can’t demonstrate that the quality of data improved unless you measure it.  You can’t report the benefit of your program unless you measure it.  And, showing improvement means that you need to measure both before and after to calculate the improvement.

Senior executives in organizations want to know what value a technology investment brings them.  And the ways to show value are increased revenue, lowered cost,  and reduced risk (which can include regulatory compliance). Without reporting financial benefit to management few organizations are willing to support ongoing improvement projects for multiple years.  Also, it is important to report both what the financial benefit has been and what additional opportunities remain – management is very happy to  declare success and terminate the program unless you are also reporting what remains to be done.

Show Me the Money! – Monetizing Data Management

October 23, 2011

On November 10 Dr. Peter Aiken will be coming to Northern NJ to speak at the DAMA NJ meeting about “Monetizing Data Management” – understanding the value of Data Management initiatives to an organization and the cost of not making Data Management investment.

At Data Management conferences I’ve been to over the last few years people are still more likely to attend sessions on improving data modeling techniques than on valuing data assets or data management investment, or the cost of poor quality data.  Maybe it’s just the nature of the conferences I attend, more technical than business oriented.  But I think every information technology professional should be prepared to explain, when asked or given the opportunity, why these investments are imperative.

If you can make it to Berkeley Heights on November 10, I hope you will attend Dr. Aiken’s presentation, which you will find to be a great use of your time.



When Technology Leads – The Tail Wagging the Dog

September 27, 2011

There is a great temptation to implement technology because it is “cool”, but for decades business and technology strategists (as well as most people in both business and technology) have realized that unless your business is to sell technology, the implementation of technology should be in support of business goals.  Sometimes, technology innovations can provide entirely new ways of performing business services and allow business differentiation.  In fact, there is a movement toward technology strategy being developed in collaboration with business strategy, rather than subsequently.

There are also some business functions that must be performed by every organization that are critical to business operation where, in practice, the technology organization tends to lead. One such area is “Business Continuity”, preparing for emergencies and business disruptions.  This is a business responsibility which cannot be simply delegated to the technology organization, and yet it requires significant specialized expertise, and in practice tends to be developed mostly by highly trained technologists.  The part of Business Continuity that deals with the recovery of data and computer systems is called “Disaster Recovery” and is a core technology operations capability.  So, the technology organizations tend to provide most of the resources to help business areas develop, test, and implement Business Continuity plans.  In practice, the tail wags the dog.

Best practice holds that Data Governance and Data Quality programs should be led by business managers, not IT, but there are key aspects of these programs which cannot be accomplished readily without technology support.  The key skills involved in performing these functions involve process improvement and data analysis capabilities, which are skills found most frequently in technology organizations.  Frequently, Data Governance and Data Quality initiatives get started in IT, but tend to be much more successful when led from business areas.

Unattainable Data Quality

August 16, 2011

Is perfect Data Quality an attainable goal for an organization? Today I saw a blog post from Henrik Liliendahl Sørensen on “Unmaintainability”. My first reading of this title was “unattainability” which got me thinking about how Data Quality can be seen as “Unattainable.”

When I was first hired by a particular multi-national financial services organization in the late 1980s, my title was “Global System Deployment Specialist,” which did not, however, refer to weapons systems but rather that I was a specialist in the implementation phases of global application systems development and operation. I was a Closer! Interestingly (well, to me), there are few people who are particularly good at this.  Organizations tend to focus on development but hesitate, especially with very large systems made up of hundreds or thousands of programs, to finally “go live.”  One part of this problem is that those who have little experience with large systems may believe and promote that a system should be without issues prior to implementation. Ha!  The key to breaking through that “unattainable” goal is to classify and prioritize issues with strict definitions of priorities.  Should a misspelled word on an internal screen stop implementation?  Should an enhancement request stop implementation? The costs of delayed implementation can be astronomical and need to be managed firmly. Admittedly, there are many examples of disastrous system implementations that are even more costly.

Of similar “unattainability” was perfectly secure systems.  Financial Services has always been on the bleeding edge of security technology, because they tend to be a target for security attacks.  So, I developed a view of designing and implementing secure systems as being the equivalent of trying to achieve nirvana – a goal for which one constantly strives without ever achieving that goal.  We make our systems secure based on organization and regulatory standards and best practices, balancing cost and risk, as appropriate.  Hey, I once implemented a data warehouse in Switzerland to which no one was allowed to have access.  Was this perfect security?  No, there was no business value in a system no one could access.

The issue of perfect Data Quality has a similar unattainability.  We need to classify the types of issues that may be found with data and the importance of particular types of data to the organization.  We need to understand the regulatory and organizational rules associated with different types of data.  We need to assess the quality of the data and determine what are realistic and cost-effective goals for improvement.  Much data may not even be important enough to an organization to warrant the cost of assessment.  Then, we need to balance the cost of fixing data with the risk of not. 

Achieving perfect Data Quality may be “unattainable”.  But the real goal is to understand and manage the risks and costs associated with improving organizational Data Quality.

Data Quality Reports in a Data Governance Program

May 4, 2011

There are two types of Data Quality reports that are regularly produced for Data Governance: data out of compliance with business rules and statistics on data out of compliance with business rules including if the data has gotten better or worse from previous.

I suppose there is also the report of metrics on the Data Governance program as well.