Drivers for Managing Data Integration – from Data Conversion to Big Data

April 25, 2013

Data management in an organization is focused on getting data to its data consumers (whether human or application). Whereas the goal of data quality and data governance is trusted data, the goal of data integration is available data – getting data to the data consumers in the format that is right for them.
My new book on Data Integration has been published and is now available: “Managing Data in Motion: Data Integration Best Practice Techniques and Technologies”. Of course the first part of a book on data management techniques has to answer the question of why an organization should invest time and effort and money. The drivers for data integration solutions are very compelling.
Supporting Data Conversion
One very common need for data integration techniques is when copying or moving data from one application or data store to another, either when replacing an application in a portfolio or seeding the data needed for an additional application implementation. It is necessary to format the data as appropriate for the new application data store, both the technical format and the semantic business meaning of the data.
Managing the Complexity of Data Interfaces by Creating Data Hubs – MDM, Data Warehouses & Marts, Hub & Spoke
This, I think, is the most compelling reason for an organization to have an enterprise data integration strategy and architecture: hubs of data significantly simplify the problem of managing the data flowing between the applications in an organization. The number of potential interfaces between applications in an organization is an exponential function of the number of applications. Thus, an organization with one thousand applications could have as many as half a million interfaces, if all applications had to talk to all others. By using hubs of data, an organization brings the potential number of interfaces down to be just a linear function of the number of applications.
Master Data Management hubs are created to provide a central place for all applications in an organization to get its Master Data. Similarly, Data Warehouses and Data Marts enable an organization to have one place to obtain all the data they need for reporting and analysis.
Data hubs that are not visible to the human data consumers of the organization can be used to significantly simplify the natural complexity of data interfaces. If data being passed around in the organization is formatted, on leaving the application where it has been updated, into a common data format for that type of data, then applications updating data only need to reformat data into one format, instead of a different format for every application that needs it. Applications that need to receive the data that has been updated only need to reformat the data from the one common format into their own needs. This approach to data integration architecture is called using a “hub and spoke” approach. The structure of the common data format that all applications pass their data to and from is called the “canonical model.” Applications that want a certain kind of data need to “subscribe” to that data and applications that provide a certain kind of data are said to “publish” the data.
Integrating Vendor Packages with an Organization’s Application Portfolio
Current best practice is to buy vendor packages rather than developing custom applications, whenever possible. This exacerbates the data integration problem because each of these vendor packages will have their own master data that have to be integrated with the organization’s master data and they will either have to send or receive transactional data for consolidated reporting and analytics.
Sharing Data Among Applications and Organizations
Some data just naturally needs to flow between applications to support the operational processes of the organization. These days, that flow of data usually needs to be in a real time or near real time mode, and it makes sense to solve the requirements across the enterprise or across the applications that support the supply chain of data rather than developing independent solutions for each application.
Archiving Data
The life cycle for data may not match the life cycle for the application in which it resides. Some data may get in the way if retained in the active operational application and some data may need to be retained after an application is retired, even if the data is not being migrated to another application. All enterprises should have an enterprise archiving solution available where data can be housed and from which it can still be retrieved, even if the application from which it was taken no longer exists.
Moving data out of an application data store and restructuring it for an enterprise archiving solution is an important data integration function.
Leveraging External Available Data
There is so much data now available from government and other sites external to a company’s own, for free as well as data available for a fee. In order to leverage the value of what is available the external data needs to be made available to the data consumers who can use it, in an appropriate format. The amount of data now available is so vast and so fast that it may not be warranted to store or persist the external data, rather using techniques with data virtualization and streaming data, or not to store the data within the organization, choosing instead to leverage cloud solutions that are also external.
Integrating Structured and Unstructured Data
New tools and techniques allow analysis of unstructured data such as documents, web sites, social media feeds, audio, and video data. Greatest meaning can be applied to the analysis when it is possible to integrate together structured data (found in databases) and unstructured data types listed above. Data integration techniques and new technologies such as data virtualization servers enable the integration of structured and unstructured data.
Support Operational Intelligence and Management Decision Support
Using data integration to leverage big data includes not just mashing different types of data together for analysis, but being able to use data streams with that big data analysis to trigger alerts and even automated actions. Example use cases exist in every industry but some of the ones we’re all aware of include monitoring for credit card fraud as well as recommending products.

Advertisement

People who Tweet about Data Management

April 30, 2012

Data Management & Architecture

Karen Lopez @datachick

Neil Raden @NeilRaden

Robin Bloor @robinbloor

M. David Allen @mdavidallen

Sue Geuens @suegeuens

Mehmet Orun @DataMinstrel

Alec Sharp @alecsharp

Loretta Mahon Smith @silverdata

Eva Smith @datadeva

Corine Jasonius @DataGenie

Peter Aiken @paiken

Tony Shaw @tonyshaw

Glenn Thomas @Warduke

Bonnie O’Neil @bonnieoneil

Rob Paller @RobPaller

Pete Rivett @rivettp

Charles T. Betz @CharlesTBetz

Tracie Larsen @RelatedStuff

Wayne Eckerson @weckerson

Julian Keith Loren @jkloren

Christophe @mydatanews

Steve Francia @spf13

Gorm Braavig @gormb

Jim Finwick @jimfinwick

Alexej Freund @alexej_freund

Corinna Martinez @Futureatti

Data Quality

Jim Harris @ocdqblog – blog

David Loshin @davidloshin – blog

Rich Murnane @murnane

Daragh O Brien @daraghobrien

Jacqueline Roberts @JackieMRoberts

Steve Tuck @SteveTuck

Vish Agashe @VishAgashe

Julian Schwarzenbach @jschwa1

Henrik L. Sorensen @hlsdk

MDM and Data Governance

Jill Dyche @jilldyche – blog

Charles Blyth @charlesblyth

Steve Sarsfield @stevesarsfield – blog

Dan Power @dan_power

Philip Tyler @tylep0

Business Intelligence and Analytics

Marcus Borba @marcusborba

Tamara Dull @tamaradull

Claudia Imhoff @Claudia_Imhoff – blog

Scott Wallask @BI_expert

Peter Thomas @PeterJThomas – blog

Barney Finucane @bfinucane

Matt Winkleman @mattwinkleman

Stray_Cat @Stray_Cat

Brett2point0 @Brett2point0

Risk Management

Peter Went @Bank_Risk

Joshua Corman @joshcorman

Michael Rasmussen @GRCPundit

Nenshad Bardoliwalla @nenshad

Gary Byrne @GRCexpert

Helmut Schindlwick @Schindwick

Technology Companies and Data Organizations

Oracle @Oracle

DAMA international @DAMA_I

McKinsey on BT @mck_biztech

SmartData Collective @SmartDataCo

DataFlux InSight @Datafluxinsight

Gartner @Gartner_inc

TDWI @TDWI

Scientific  Computing @SciCom

Wearecloud @wearecloud

CloudCamp @cloudcamp

Panorama Software @PanoramaSW

Data Hole @datahole

BI Knowledge Base @biknowledgebase

EnterpriseArchitects @enterprisearchitects

DataQualityPro.com @dataqualitypro

RSA Archer eGRC @ArcherGRC

Exobox @Exobox_Security

EA_Consultant @EA_Consultant

Cloudbook @cloudbook

ID Experts @idexperts

IAIDQ @iaidq

EMC Forum @EMCForums

Data Junkies @datajunkies

True Finance Data @truefinancedata

Madam @TheMDMNetwork

IBM Initiate @IBMInitiate

Accelus_GRC @PaisleyGRC

DQ Asia Pacific @DQAsiaPacific

Data Guide @DataGuide

PCI PA-DSS Data @DataAssurant

DataFlux Corporation @DataFlux


Data Virtualization – Part 1 – Business Intelligence

March 26, 2012

The big transformation that we’re all dealing with in technology today is virtualization.  There are many aspects to virtualization: infrastructure, systems, organization, office, applications. When you search on the internet for “data virtualization,” most of the references are regarding business intelligence and data warehousing uses.  In part 2 of this blog I will talk about data virtualization and transaction processing.

In the day, when I used to build data warehouses (1990s+), there was a reference to a concept of “federated data warehouses”, where data in the logical warehouse would be separated physically either with the same schemas in multiple instances or different types of data in different locations.  The thought was that the data would be physically separate but brought together real time for reporting.  We also used to call that “data warehouses that don’t work”.  After all, the reason we created data warehouses in the first place was that we needed to instantiate the data consolidation in order to make the response time reasonable when trying to report on millions of records. No, really, the response time on these “federated data warehouse” systems used to be many minutes or more.

Now, however, the technologies involved have made huge leaps in capabilities.  The vendors have put thousands and thousands of man hours into how to make real time integration and reporting work.  There are many techniques involving specialized software and hardware (data appliances) that enable these capabilities, query optimization, distributed processing, and other optimization techniques, and hybrid solutions between pure virtualization and instantiation.  Specialized tuning is necessary, and the fastest solutions involve instantiating the consolidated data in a central place.

Ultimately, having to do a project to incorporate new data into the data warehouse physically isn’t responsive enough to the business need for information.  Better to have a short term solution that allows for the quick incorporation of new data and then, if there is a continued need for the data in question and you want to speed up the response, possibly integrate the additional data into the physical data warehouse.

The problems being solved now for business intelligence and data virtualizations include real time data integration of multiple regional instances of a data warehouse, integrating data of different types and kinds, and integrating data from a data warehouse with big data and cloud data.  This enables much more responsive business intelligence and analytical solutions to business requests without having to always instantiate all data for analysis into a central, single, enterprise data warehouse.


How Useful are NoSQL Products?

September 7, 2011

In August I attended the NoSQL Conference in San Jose, California. This conference was about products and solutions that, primarily, don’t use relational databases.  The recent rise of interest comes from the Big Data space and includes areas that aren’t necessarily too big for relational databases, but that just don’t lend themselves to relational database solutions. Relational databases became the ubiquitous storage solution for data in application systems around the early to late 1990s.  However, I was a working programmer for more than 10 years before that and so I’ve worked with hierarchical databases and indexed file solutions, among other things.  In the late 1990s I had some very good experiences working with multidimensional databases for data marts, which are also not relational.  One of the keynotes for the NoSQL conference was from Robin Bloor on all the terrible things about relational databases and how it could be done better.  Every sentence out of his mouth was controversial and thought provoking.

The main question in my mind during the conference was “how are these NoSQL technologies and products useful to my customers?  For a large organization with a well established data management portfolio that is based on relational databases, what business problems would be better solved with something else?”

The Big Data technology movement was started around the Hadoop file system and Map Reduce applications to solve problems of searching web data.  This technology solution is used by many web (Google, Yahoo, Amazon) and social media companies to manage and search vast amounts of data across multiple servers.  It introduces solutions for storing and searching unstructured data cheaply distributed across many servers.  How might Hadoop and Map Reduce be of interest to traditional Data Management organizations?  It introduces search of unstructured data, distributed processing, and possibly Cloud technology (if the distributed servers are in the Cloud).  This gets into the idea of being able to search through vast amounts of organizational data that might previously have seemed too trivial to spend money to store or too expensive to search.  There are quite a few Business Intelligence solutions that don’t use relational database technology or which combine it with other databases and technologies.  The most interesting aspect from a technology perspective is the move to distributed processing engines.

A problem area that doesn’t seem to lend itself to relational databases is, ironically, understanding how two people or things are related to one another.  Examples of this problem include analyzing the degrees of separation of two participants in a social network or understanding the relationship between a suspected terrorist and someone who calls them on the telephone. These types of problems are better solved using a graph or node database with a recursive analytical engine. By the way, when was the last time you wrote a recursive program?  Better get out your Knuth algorithms book.

Multidimensional databases pre-calculate all (or most) summaries along multiple hierarchies or taxonomies such as customer, product, organizational structure, or accounting bucket.  They are blindingly FAST for responding to queries, but not dynamic and require the step where all the calculations are done after the data is loaded.  Applications for these solutions are data mart cubes.  My experience has been particularly good supporting the Finance organization analytical needs.  The capabilities of these databases can be mimicked in a relational database using Kimball’s dimensional modeling and summary tables.

XML databases deal well with the problem of storing and searching data where the structure of the data may be unknown in advance.  Applications around data messages and documents do well using XML database solutions.

Another day I’ll blog about some of the inherent problems of relational databases with speed and volumes, but the main point is that organizations are finding it worthwhile to expand their database solutions beyond just relational database management systems.


Social Media and Data Management

June 7, 2011

Two years ago my friends convinced me to set up a Twitter account because, as they said, if I consider myself a professional “data person” (can I use that as a title on a business card?), then I should know more about this important data outlet.  My use of Twitter arises from two motivations: to create and manage a public persona for myself and also to understand how to advise on social media data management.

Social media is not just a presentation but a conversation, where one can present an image to the public and also mine the conversation about yourself, your competitors, your market.  This is, indeed, an important input into Business Intelligence.