Drivers for Managing Data Integration – from Data Conversion to Big Data

April 25, 2013

Data management in an organization is focused on getting data to its data consumers (whether human or application). Whereas the goal of data quality and data governance is trusted data, the goal of data integration is available data – getting data to the data consumers in the format that is right for them.
My new book on Data Integration has been published and is now available: “Managing Data in Motion: Data Integration Best Practice Techniques and Technologies”. Of course the first part of a book on data management techniques has to answer the question of why an organization should invest time and effort and money. The drivers for data integration solutions are very compelling.
Supporting Data Conversion
One very common need for data integration techniques is when copying or moving data from one application or data store to another, either when replacing an application in a portfolio or seeding the data needed for an additional application implementation. It is necessary to format the data as appropriate for the new application data store, both the technical format and the semantic business meaning of the data.
Managing the Complexity of Data Interfaces by Creating Data Hubs – MDM, Data Warehouses & Marts, Hub & Spoke
This, I think, is the most compelling reason for an organization to have an enterprise data integration strategy and architecture: hubs of data significantly simplify the problem of managing the data flowing between the applications in an organization. The number of potential interfaces between applications in an organization is an exponential function of the number of applications. Thus, an organization with one thousand applications could have as many as half a million interfaces, if all applications had to talk to all others. By using hubs of data, an organization brings the potential number of interfaces down to be just a linear function of the number of applications.
Master Data Management hubs are created to provide a central place for all applications in an organization to get its Master Data. Similarly, Data Warehouses and Data Marts enable an organization to have one place to obtain all the data they need for reporting and analysis.
Data hubs that are not visible to the human data consumers of the organization can be used to significantly simplify the natural complexity of data interfaces. If data being passed around in the organization is formatted, on leaving the application where it has been updated, into a common data format for that type of data, then applications updating data only need to reformat data into one format, instead of a different format for every application that needs it. Applications that need to receive the data that has been updated only need to reformat the data from the one common format into their own needs. This approach to data integration architecture is called using a “hub and spoke” approach. The structure of the common data format that all applications pass their data to and from is called the “canonical model.” Applications that want a certain kind of data need to “subscribe” to that data and applications that provide a certain kind of data are said to “publish” the data.
Integrating Vendor Packages with an Organization’s Application Portfolio
Current best practice is to buy vendor packages rather than developing custom applications, whenever possible. This exacerbates the data integration problem because each of these vendor packages will have their own master data that have to be integrated with the organization’s master data and they will either have to send or receive transactional data for consolidated reporting and analytics.
Sharing Data Among Applications and Organizations
Some data just naturally needs to flow between applications to support the operational processes of the organization. These days, that flow of data usually needs to be in a real time or near real time mode, and it makes sense to solve the requirements across the enterprise or across the applications that support the supply chain of data rather than developing independent solutions for each application.
Archiving Data
The life cycle for data may not match the life cycle for the application in which it resides. Some data may get in the way if retained in the active operational application and some data may need to be retained after an application is retired, even if the data is not being migrated to another application. All enterprises should have an enterprise archiving solution available where data can be housed and from which it can still be retrieved, even if the application from which it was taken no longer exists.
Moving data out of an application data store and restructuring it for an enterprise archiving solution is an important data integration function.
Leveraging External Available Data
There is so much data now available from government and other sites external to a company’s own, for free as well as data available for a fee. In order to leverage the value of what is available the external data needs to be made available to the data consumers who can use it, in an appropriate format. The amount of data now available is so vast and so fast that it may not be warranted to store or persist the external data, rather using techniques with data virtualization and streaming data, or not to store the data within the organization, choosing instead to leverage cloud solutions that are also external.
Integrating Structured and Unstructured Data
New tools and techniques allow analysis of unstructured data such as documents, web sites, social media feeds, audio, and video data. Greatest meaning can be applied to the analysis when it is possible to integrate together structured data (found in databases) and unstructured data types listed above. Data integration techniques and new technologies such as data virtualization servers enable the integration of structured and unstructured data.
Support Operational Intelligence and Management Decision Support
Using data integration to leverage big data includes not just mashing different types of data together for analysis, but being able to use data streams with that big data analysis to trigger alerts and even automated actions. Example use cases exist in every industry but some of the ones we’re all aware of include monitoring for credit card fraud as well as recommending products.


Big Data Modeling – part 2 – The Big Data Modeler

July 23, 2012

Continuing my discussion of “Big Data Modeling,” what is it and is it any different from normal data modeling?  Ultimately, the questions come down to: is there a role for a modeler on Big Data projects and what does that role look like?

Modeling for Communication –

If modeling is the process of creating a simpler representation of something that does or might exist, we can use modeling for communicating information about something in a simpler way than presenting the thing itself.  After all, we aren’t limited in describing a computer system to presenting only the system itself, but we present various models to communicate different aspects of what is or what might be.

Modeling Semantics –

On Big Data projects, as with all data oriented projects, it is necessary to communicate logical and semantic concepts about the data involved in the project.  This may involve, but is not limited to, models presented in entity-relationship diagrams.  The data modeling needs, in fact, are not limited to design of structures even but certainly includes data flows, process models, and other kinds of models.  This also would include any necessary taxonomy and ontology models.

Modeling Design –

Prior to construction it is necessary to represent (design) the data structures needed for the persistent as well as transitory data used in the project.  Persistent data structures include those in files or databases.  Transitory data structures include the messages and streams of data passing into and out of the organization as well as between applications.  For data being received from other organizations or groups, this may be receiving information rather than designing. This is, or is close to, the physical design level of the implementation including the design of database tables and structures, file layouts, metadata tags, message layouts, data services, etc.

Modeling Virtual Layers –

There is a big movement in systems development in virtualizing layers of the infrastructure, where the view presented to programmers or users may be different from the actual physical implementation.  This move toward creating virtual layers that can change independently is true in data design as well. It is necessary to design, or model, the presentation of information to the systems users (client experience) and programmers independently of the modeling of the physical data structures. This is more necessary for Big Data because it includes designing levels of virtualization for normalizing or merging data of different types into a consistent format.  In addition to the modeling of the virtual data layers there is a need for the translation from the physical data structures to the virtual level such as between relational database structures and web service objects.

Modeling Mappings and Transformations –

t is necessary in any design that involves the movement of data between systems, whether Big Data or not, to specifiy the lineage in the flow of data from physical data structure to physical data structure including the mappings and transformation rules necessary from persistent data structure to message to persistent data structure, as necessary.  This level of design requires an understanding of both the physical implementation and the business meaning of the data. We don’t usually call this activity modeling but strictly design.

Ultimately, there is a lot of work for a data modeler on Big Data projects, although little of it may look like creating entity relational models.  There is the need to create models for communicating ideas, for designing physical implementation solutions, for designing levels of virtualization, and for mapping between these models and designs.

Big Data Modeling – part 1 – Defining “Big Data” and “Data Modeling”

July 15, 2012

Last month I participated in a DataVersity webinar on Big Data Modeling .  There are a lot of definitions necessary in that discussion. What is meant by Big Data? What is meant by modeling? Does modeling mean entity-relationship modeling only or something broader?

The term “Big Data” implies an emphasis on high volumes of data. What constitutes big volumes for an organization seems to be dependent on the organization and its history.  The Wikipedia definition of “Big Data” says that an organization’s data is “big” when it can’t be comfortably handled by on hand technology solutions.  Since the current set of relational database software can comfortably handle terabytes of data and even desktop productivity software can comfortably handle gigabytes of data, “big” implies many terabytes at least.

However, the consensus on the definition of “Big Data” seems to be with the Gartner Group definition that says that “Big Data” implies large volume, variety, and velocity of data.  Therefore, “Big Data” means not just data located in relational databases but files, documents, email, web traffic, audio, video, and social media, as well.  The various types of data provides the “variety”, and not just data in an organization’s own data center but in the cloud and data from external sources as well as data on mobile devices.

The third aspect of “Big Data” is the velocity of data.  The ubiquity of sensor and global position monitoring information means a vast amount of information available at an ever increasing rate from both internal and external sources.  How quickly can this barrage of information be processed?  How much of it needs to be retained and for how long?

What is “data modeling”? Most people seem to picture this activity as synonymous with “entity relationship modeling”.  Is entity relationship modeling useful for purposes outside of relational database design?  If modeling is the process of creating a simpler representation of something that does or might exist, we can use modeling for communicating information about something in a simpler way than presenting the thing itself. So modeling is used for communicating.  Entity relationship modeling is useful to communicate information about the attributes of the data and the types of relationships allowed between the pieces of data.  This seems like it might be useful to communicate ideas outside of just relational databases.

Data modeling is also used to design data structures at various levels of abstraction from conceptual to physical. When we differentiate between modeling and design, we are mostly just differentiating between logical design and design closer to the physical implementation of a database. So data modeling is also useful for design.

In the next part of this blog I’ll get back to the question of “Big Data Modeling.”

The Problem With Point to Point Interfaces

November 21, 2011


The average corporate computing environment is comprised of hundreds to thousands of disparate and changing computer systems that have been built, purchased, and acquired.  The data from these various systems needs to be  integrated for reporting and analysis, shared for business transaction processing, and converted from one system format to another when old systems are replaced and new systems are acquired.  Effectively managing the data passing between systems is a major challenge and concern for every Information Technology organization.


Most Data Management focus is around data stored in structures such as databases and files, and a much smaller focus on the data flowing between and around the data structures.  Yet, because of the prevalence of purchasing rather than building application solutions, the management of the “data in motion” in organizations is rapidly becoming one of the main concerns for business and IT management.  As additional systems are added into an organization’s portfolio the complexity of the interfaces between the systems grows dramatically, making management of those interfaces overwhelming.


Traditional interface development quickly leads to a level of complexity that is unmanageable.  If there is one interface between every system in an application portfolio and “n” is the number of applications in the portfolio, then there will be approximately (n-1)2 / 2 interface connections.  In practice, not every system needs to interface with every other, but there may be multiple interfaces between systems for different types of data or needs.  This means for a manager of application systems that if they are managing 101 applications then there may be something like 5,000 interfaces.  A portfolio of 1001 applications may provide 500,000 interfaces to manage.  There are more manageable approaches to interface development than the traditional “point to point” data integration solutions that generate this type of complexity.


The use of a “hub and spoke” rather than “point to point” approach to interfaces changes the level of complexity of managing interfaces from exponential to linear.  The basic idea is to create a central data hub.  Instead of the need to translate from each system to every other system in the portfolio, interfaces only need to translate from the source system to the hub and then from the hub to the target system.  When a new system is added to the portfolio it is only necessary to add translations from the new system to the hub and from the hub back to the new system.  Translations to all the other systems already exist. This architectural technique to interface design makes a substantial difference to the complexity of managing an IT systems portfolio, and yet it had nothing really to do with introducing a new technology.


How many Data Integration solutions do you need?

August 8, 2011

For a long time I’ve mulled over whether it’s necessary to keep both a batch and real time infrastructure for Data Integration.  After all, if you have a fully functional real time Data Integration solution, then why do you have to operate two sets of software and monitoring? 

This is not to say that it is absolutely necessary to be running only one Data Integration solution in a large organization – as long as the solutions communicate effectively. There are clear benefits to minimizing the number of technologies running to support a particular capability.  For each technology you run you need to have programmers who are experts in the technology and you have to have people monitoring the operations.  This is in addition to the software licenses and hardware necessary for each solution environment.

So, if you have a real time Data Integration solution in operation then why not just use it for your batch needs as well? 

The problem becomes one of volumes and processing time.  The volumes involved in batch oriented data integration needed for a Data Warehouse nightly load, for example, would probably be difficult for a real time solution to handle in a timely fashion. Of course, it depends on the particular organization involved.  And, given enough money and expertise, you can probably get a real time data integration solution to process whatever volumes are necessary in the time slot available. 

Batch Data Integration solutions (i.e. ETL tools) are focused on handling large volumes in a small time slot, so for most organizations, it is worthwhile to have both batch and real time Data Integration solutions in operation.   If you are calculating alternative costs, remember to include the cost of monitoring operations of multiple solutions as well as the cost of having expertise on staff for multiple technologies – and don’t forget the costs of conversion if you’re thinking of eliminating existing technologies.

Data Integration and Data Governance for Cloud Computing

July 18, 2011

I was recently reviewing the Data Integration architecture for a client and they asked me what they should be looking at for Data Integration when they start using Cloud Computing.  The simple, and boring, answer is that you should be able to use the same solutions in the Cloud as you are in a traditional server/database environment.  Your Enterprise Service Bus is specifically meant to be able to integrate across heterogeneous technologies.  To integrate data from a Cloud Computing environment with other data should require adapters for the specific technologies of the servers where the data is located, either for HADOOP, other specialized file systems, or specialized database management systems.

The answer for Data Governance of Cloud Computing is even more boring – there aren’t any changes.  Data Governance is about managing and processes and is technology independent.  You may need some specialized tools for profiling data and reporting data quality metrics, but the Data Governance process itself doesn’t change between technologies.

Architecting MDM for Reporting versus Real-time Processing

June 16, 2011

In recent discussions with Joseph Dossantos, he pointed out to me that the differences in architecting an MDM solution for Reporting, such as for a Data Warehouse, and for real-time transaction processing, go beyond the choice of batch versus real-time Data Integration.  Obviously, although the use of a batch ETL solution may be appropriate for integrating the source and target systems with a Master Data hub, it is insufficient for update and access to Master Data being used in transaction processing.  For real-time Data Integration it is better to use an Enterprise Service Bus (ESB) and / or Service Oriented Architecture (SOA).

However, there are other differences in the architectural solution for real-time MDM.  The common functions of MDM, such as matching and deduplication, also need to be architected for real-time use.  The response to information requests needs to be instantaneous. Master Data for Reporting flows from source to hub to target to report (see Inmon’s Corporate Information Factory) but for transaction processing, all capabilities must be able to happen in any order or simultaneously.

Data Integration is the key to Everything

June 2, 2011

I’ve been thinking a lot recently about Data Integration.  Since most applications at organizations are now purchased packages, it seems that most of the custom development an organization needs to do is around consolidating data into a Data Warehouse or integrating applications together.  Integration isn’t just one of a CIO’s biggest problems, it should be one of their biggest focuses (maybe following only security, business continuity, and core application support).

Today, I was thinking about Data Integration in Cloud Computing, and I realized that Data Integration is the key enabler to Private Cloud / Public Cloud hybrid computing.  In fact, Data Integration is key to any interaction between data stored offsite (“In the Cloud!”) and onsite.

My mind dismisses what it doesn’t want to see

May 8, 2011

Have you heard about this phenomena that your mind fills in pieces of a scene it expects even if your eyes don’t see something?

I’ve been reading various books on Data Integration lately because I am thinking of proposing my own book on the subject and I was interested in what was already written.  So I was reading a chapter in “Data Integration Blueprint and Modeling” by Anthony David Giodana on various Data Integration architectures and he described a “Federated” architecture as being one where tables in various databases on even seperate servers are joined together.  Now this is a perfectly acceptable achitectural concept but is, in my experience, so incredibly slow that it is not really a viable option. ( I should mention that Mr. Giodana does say that it is not suggested for real-time processing. )  I had basically removed the option from my mind until I read his description.  There may very well be times when you don’t want all the duplicate data involved in replication but since it was not something I ever intended on doing, it was gone … gone from my brain.

What do you call it?

May 2, 2011

At Enterprise Data World I gave a workshop which was an introduction to Data Integration. The material seemed to fall into two types: information about technologies and information about … architecture? patterns? design? methodology? For example, a Hub and Spoke architecture is not a technology but a way to design interfaces in an efficient way. For now, I’ll call it architecture or design.