Unattainable Data Quality

August 16, 2011

Is perfect Data Quality an attainable goal for an organization? Today I saw a blog post from Henrik Liliendahl Sørensen on “Unmaintainability”. My first reading of this title was “unattainability” which got me thinking about how Data Quality can be seen as “Unattainable.”

When I was first hired by a particular multi-national financial services organization in the late 1980s, my title was “Global System Deployment Specialist,” which did not, however, refer to weapons systems but rather that I was a specialist in the implementation phases of global application systems development and operation. I was a Closer! Interestingly (well, to me), there are few people who are particularly good at this.  Organizations tend to focus on development but hesitate, especially with very large systems made up of hundreds or thousands of programs, to finally “go live.”  One part of this problem is that those who have little experience with large systems may believe and promote that a system should be without issues prior to implementation. Ha!  The key to breaking through that “unattainable” goal is to classify and prioritize issues with strict definitions of priorities.  Should a misspelled word on an internal screen stop implementation?  Should an enhancement request stop implementation? The costs of delayed implementation can be astronomical and need to be managed firmly. Admittedly, there are many examples of disastrous system implementations that are even more costly.

Of similar “unattainability” was perfectly secure systems.  Financial Services has always been on the bleeding edge of security technology, because they tend to be a target for security attacks.  So, I developed a view of designing and implementing secure systems as being the equivalent of trying to achieve nirvana – a goal for which one constantly strives without ever achieving that goal.  We make our systems secure based on organization and regulatory standards and best practices, balancing cost and risk, as appropriate.  Hey, I once implemented a data warehouse in Switzerland to which no one was allowed to have access.  Was this perfect security?  No, there was no business value in a system no one could access.

The issue of perfect Data Quality has a similar unattainability.  We need to classify the types of issues that may be found with data and the importance of particular types of data to the organization.  We need to understand the regulatory and organizational rules associated with different types of data.  We need to assess the quality of the data and determine what are realistic and cost-effective goals for improvement.  Much data may not even be important enough to an organization to warrant the cost of assessment.  Then, we need to balance the cost of fixing data with the risk of not. 

Achieving perfect Data Quality may be “unattainable”.  But the real goal is to understand and manage the risks and costs associated with improving organizational Data Quality.

How many Data Integration solutions do you need?

August 8, 2011

For a long time I’ve mulled over whether it’s necessary to keep both a batch and real time infrastructure for Data Integration.  After all, if you have a fully functional real time Data Integration solution, then why do you have to operate two sets of software and monitoring? 

This is not to say that it is absolutely necessary to be running only one Data Integration solution in a large organization – as long as the solutions communicate effectively. There are clear benefits to minimizing the number of technologies running to support a particular capability.  For each technology you run you need to have programmers who are experts in the technology and you have to have people monitoring the operations.  This is in addition to the software licenses and hardware necessary for each solution environment.

So, if you have a real time Data Integration solution in operation then why not just use it for your batch needs as well? 

The problem becomes one of volumes and processing time.  The volumes involved in batch oriented data integration needed for a Data Warehouse nightly load, for example, would probably be difficult for a real time solution to handle in a timely fashion. Of course, it depends on the particular organization involved.  And, given enough money and expertise, you can probably get a real time data integration solution to process whatever volumes are necessary in the time slot available. 

Batch Data Integration solutions (i.e. ETL tools) are focused on handling large volumes in a small time slot, so for most organizations, it is worthwhile to have both batch and real time Data Integration solutions in operation.   If you are calculating alternative costs, remember to include the cost of monitoring operations of multiple solutions as well as the cost of having expertise on staff for multiple technologies – and don’t forget the costs of conversion if you’re thinking of eliminating existing technologies.

Data Integration and Data Governance for Cloud Computing

July 18, 2011

I was recently reviewing the Data Integration architecture for a client and they asked me what they should be looking at for Data Integration when they start using Cloud Computing.  The simple, and boring, answer is that you should be able to use the same solutions in the Cloud as you are in a traditional server/database environment.  Your Enterprise Service Bus is specifically meant to be able to integrate across heterogeneous technologies.  To integrate data from a Cloud Computing environment with other data should require adapters for the specific technologies of the servers where the data is located, either for HADOOP, other specialized file systems, or specialized database management systems.

The answer for Data Governance of Cloud Computing is even more boring – there aren’t any changes.  Data Governance is about managing and processes and is technology independent.  You may need some specialized tools for profiling data and reporting data quality metrics, but the Data Governance process itself doesn’t change between technologies.

Architecting MDM for Reporting versus Real-time Processing

June 16, 2011

In recent discussions with Joseph Dossantos, he pointed out to me that the differences in architecting an MDM solution for Reporting, such as for a Data Warehouse, and for real-time transaction processing, go beyond the choice of batch versus real-time Data Integration.  Obviously, although the use of a batch ETL solution may be appropriate for integrating the source and target systems with a Master Data hub, it is insufficient for update and access to Master Data being used in transaction processing.  For real-time Data Integration it is better to use an Enterprise Service Bus (ESB) and / or Service Oriented Architecture (SOA).

However, there are other differences in the architectural solution for real-time MDM.  The common functions of MDM, such as matching and deduplication, also need to be architected for real-time use.  The response to information requests needs to be instantaneous. Master Data for Reporting flows from source to hub to target to report (see Inmon’s Corporate Information Factory) but for transaction processing, all capabilities must be able to happen in any order or simultaneously.

Social Media and Data Management

June 7, 2011

Two years ago my friends convinced me to set up a Twitter account because, as they said, if I consider myself a professional “data person” (can I use that as a title on a business card?), then I should know more about this important data outlet.  My use of Twitter arises from two motivations: to create and manage a public persona for myself and also to understand how to advise on social media data management.

Social media is not just a presentation but a conversation, where one can present an image to the public and also mine the conversation about yourself, your competitors, your market.  This is, indeed, an important input into Business Intelligence.

Data Integration is the key to Everything

June 2, 2011

I’ve been thinking a lot recently about Data Integration.  Since most applications at organizations are now purchased packages, it seems that most of the custom development an organization needs to do is around consolidating data into a Data Warehouse or integrating applications together.  Integration isn’t just one of a CIO’s biggest problems, it should be one of their biggest focuses (maybe following only security, business continuity, and core application support).

Today, I was thinking about Data Integration in Cloud Computing, and I realized that Data Integration is the key enabler to Private Cloud / Public Cloud hybrid computing.  In fact, Data Integration is key to any interaction between data stored offsite (“In the Cloud!”) and onsite.

Driving Unstructured Data Management

May 25, 2011

In preparing my presentation for the Data Governance Conference on Unstructured Data Management, I am thinking how a great deal of the focus seems to be on tagging unstructured data (email, documents) with expiration dates to help manage the huge and geometrically growing volumes of unstructured data.  Attention seems to be so much more on managing the end of the life cycle of unstructured data than it does with structured data.

In my experience, too, the goal of the legal department seems to be in eliminating as much data in the organization as possible.  This appears counter intuitive – since most people’s interaction with the legal department is when they tell us not to delete anything.  But, legal departments would prefer for the organization to have policies that remove old documents and email altogether so that when asked by a court to produce documents they can say that company policy is to get rid of documents of that age (with specific exceptions including what an organization is legally required to retain) and not have to embark on an expensive search project.  If the documents requested are within the company policy for retention, then Legal wants them organized for efficient search and retrieval.

My mind dismisses what it doesn’t want to see

May 8, 2011

Have you heard about this phenomena that your mind fills in pieces of a scene it expects even if your eyes don’t see something?

I’ve been reading various books on Data Integration lately because I am thinking of proposing my own book on the subject and I was interested in what was already written.  So I was reading a chapter in “Data Integration Blueprint and Modeling” by Anthony David Giodana on various Data Integration architectures and he described a “Federated” architecture as being one where tables in various databases on even seperate servers are joined together.  Now this is a perfectly acceptable achitectural concept but is, in my experience, so incredibly slow that it is not really a viable option. ( I should mention that Mr. Giodana does say that it is not suggested for real-time processing. )  I had basically removed the option from my mind until I read his description.  There may very well be times when you don’t want all the duplicate data involved in replication but since it was not something I ever intended on doing, it was gone … gone from my brain.

Data Quality Reports in a Data Governance Program

May 4, 2011

There are two types of Data Quality reports that are regularly produced for Data Governance: data out of compliance with business rules and statistics on data out of compliance with business rules including if the data has gotten better or worse from previous.

I suppose there is also the report of metrics on the Data Governance program as well.

The start of a new week in Data Management

May 2, 2011

I tweet about Data Management @datagrrl, but it’s hard to explain an idea in 140 characters.  So I’ll use this space to write more fully. I also will preview ideas I’m working on for articles, presentations, etc.