Drivers for Managing Data Integration – from Data Conversion to Big Data

April 25, 2013

Data management in an organization is focused on getting data to its data consumers (whether human or application). Whereas the goal of data quality and data governance is trusted data, the goal of data integration is available data – getting data to the data consumers in the format that is right for them.
My new book on Data Integration has been published and is now available: “Managing Data in Motion: Data Integration Best Practice Techniques and Technologies”. Of course the first part of a book on data management techniques has to answer the question of why an organization should invest time and effort and money. The drivers for data integration solutions are very compelling.
Supporting Data Conversion
One very common need for data integration techniques is when copying or moving data from one application or data store to another, either when replacing an application in a portfolio or seeding the data needed for an additional application implementation. It is necessary to format the data as appropriate for the new application data store, both the technical format and the semantic business meaning of the data.
Managing the Complexity of Data Interfaces by Creating Data Hubs – MDM, Data Warehouses & Marts, Hub & Spoke
This, I think, is the most compelling reason for an organization to have an enterprise data integration strategy and architecture: hubs of data significantly simplify the problem of managing the data flowing between the applications in an organization. The number of potential interfaces between applications in an organization is an exponential function of the number of applications. Thus, an organization with one thousand applications could have as many as half a million interfaces, if all applications had to talk to all others. By using hubs of data, an organization brings the potential number of interfaces down to be just a linear function of the number of applications.
Master Data Management hubs are created to provide a central place for all applications in an organization to get its Master Data. Similarly, Data Warehouses and Data Marts enable an organization to have one place to obtain all the data they need for reporting and analysis.
Data hubs that are not visible to the human data consumers of the organization can be used to significantly simplify the natural complexity of data interfaces. If data being passed around in the organization is formatted, on leaving the application where it has been updated, into a common data format for that type of data, then applications updating data only need to reformat data into one format, instead of a different format for every application that needs it. Applications that need to receive the data that has been updated only need to reformat the data from the one common format into their own needs. This approach to data integration architecture is called using a “hub and spoke” approach. The structure of the common data format that all applications pass their data to and from is called the “canonical model.” Applications that want a certain kind of data need to “subscribe” to that data and applications that provide a certain kind of data are said to “publish” the data.
Integrating Vendor Packages with an Organization’s Application Portfolio
Current best practice is to buy vendor packages rather than developing custom applications, whenever possible. This exacerbates the data integration problem because each of these vendor packages will have their own master data that have to be integrated with the organization’s master data and they will either have to send or receive transactional data for consolidated reporting and analytics.
Sharing Data Among Applications and Organizations
Some data just naturally needs to flow between applications to support the operational processes of the organization. These days, that flow of data usually needs to be in a real time or near real time mode, and it makes sense to solve the requirements across the enterprise or across the applications that support the supply chain of data rather than developing independent solutions for each application.
Archiving Data
The life cycle for data may not match the life cycle for the application in which it resides. Some data may get in the way if retained in the active operational application and some data may need to be retained after an application is retired, even if the data is not being migrated to another application. All enterprises should have an enterprise archiving solution available where data can be housed and from which it can still be retrieved, even if the application from which it was taken no longer exists.
Moving data out of an application data store and restructuring it for an enterprise archiving solution is an important data integration function.
Leveraging External Available Data
There is so much data now available from government and other sites external to a company’s own, for free as well as data available for a fee. In order to leverage the value of what is available the external data needs to be made available to the data consumers who can use it, in an appropriate format. The amount of data now available is so vast and so fast that it may not be warranted to store or persist the external data, rather using techniques with data virtualization and streaming data, or not to store the data within the organization, choosing instead to leverage cloud solutions that are also external.
Integrating Structured and Unstructured Data
New tools and techniques allow analysis of unstructured data such as documents, web sites, social media feeds, audio, and video data. Greatest meaning can be applied to the analysis when it is possible to integrate together structured data (found in databases) and unstructured data types listed above. Data integration techniques and new technologies such as data virtualization servers enable the integration of structured and unstructured data.
Support Operational Intelligence and Management Decision Support
Using data integration to leverage big data includes not just mashing different types of data together for analysis, but being able to use data streams with that big data analysis to trigger alerts and even automated actions. Example use cases exist in every industry but some of the ones we’re all aware of include monitoring for credit card fraud as well as recommending products.


Is High Availability Sexy?

April 10, 2013

The subject of business continuity has grown in appeal for me as my years in IT have grown, especially as I have personally experienced disasters big and small and the need to recover systems and facilities. I became particularly interested during my training as an IT auditor.
The area of business continuity isn’t necessarily a “sexy” part of data management, but it is a franchise requirement for most organizations and corporations and certainly critical for financial services organizations. Interestingly, the responsibility for business continuity is a business responsibility and yet the knowledge and training for how to implement it is a specialty within information technology (IT). I call this “the tail wagging the dog” because the responsibility cannot be delegated to IT and yet IT needs to lead the process of how to implement it.
The way we implement business continuity is using techniques in disaster recovery and high availability. Disaster recovery is how to bring back up systems and access after the loss of power, services, or access to a facility. High availability is a similar concept except maintaining system continuity by switching to alternative resources automatically at the loss of any resource, system, connection, facility, etc.
The rule of thumb with business continuity is that the lower the amount of time of any disruption at the loss of a resource, the higher the cost. Thus, a high availability solution that has no disruption has the highest cost. Organizations that require high availability solutions are therefore frequently spending millions of dollars on their disaster recovery solutions and millions more on their high availability solutions.
EMC has recently released a new high availability services product and is now asking the question “why invest in both disaster recovery and high availability?” http://www.emc.com/about/news/press/2013/20130212-01.htm
Maybe organizations that require high availability should put their business continuity budgets into that rather than dividing between both high availability and disaster recovery. Well, it may not be such a simplistic answer. Should every single application in the organization be set up with high availability? And yet, dividing systems between continuity solutions makes testing and effecting business continuity much more difficult. As long as the organization can prove they have a high availability solution for everything that would serve any necessary disaster recovery requirements.
OK. High availability isn’t sexy. But to me, it is slightly sexier than disaster recovery. Certainly it is time to consider whether it is more cost effective to put the entire business continuity budget into high availability.


Data Virtualization – Part 2 – Data Caching

June 10, 2012

Another type of Data Virtualization that is less frequently discussed than Business Intelligence (see my previous blog) has to do with having data available in the computer’s memory, or as close to it as possible, in order to tremendously speed up the speed of both data access and update.

A simplistic way of thinking about the relative time to retrieve data is that if it takes a certain amount of time in nanoseconds to retrieve something in memory then it will be something like 1000 times that to retrieve data from disk (milliseconds). Depending on the infrastructure configuration, retrieving data over a LAN or from the internet may be ten to 1000 times slower than that. If I load my most heavily used data into memory in advance, or something that behaves like memory, then my processing of that data should be speeded up by multiple orders of magnitude.  Using solid state disk for heavily used data can achieve access and update response times similar to having data in memory.  Computer memory, as well as solid state drives, although not as cheap as traditional disk, are certainly substantially less expensive than they used to be.

Using memory caching of data can be done using traditional databases and sequential processing, and multiple orders of magnitude response time improvements can be achieved.  However, really spectacular performance is possible if we combine memory caching with parallel computing and databases designed around data caching, such as GemFire.  This does require that we develop systems using these new technologies and approaches in order to take advantage of parallel processing and optimized data caching, but the results can be blazingly fast.


The Problem With Point to Point Interfaces

November 21, 2011

 

The average corporate computing environment is comprised of hundreds to thousands of disparate and changing computer systems that have been built, purchased, and acquired.  The data from these various systems needs to be  integrated for reporting and analysis, shared for business transaction processing, and converted from one system format to another when old systems are replaced and new systems are acquired.  Effectively managing the data passing between systems is a major challenge and concern for every Information Technology organization.

 

Most Data Management focus is around data stored in structures such as databases and files, and a much smaller focus on the data flowing between and around the data structures.  Yet, because of the prevalence of purchasing rather than building application solutions, the management of the “data in motion” in organizations is rapidly becoming one of the main concerns for business and IT management.  As additional systems are added into an organization’s portfolio the complexity of the interfaces between the systems grows dramatically, making management of those interfaces overwhelming.

 

Traditional interface development quickly leads to a level of complexity that is unmanageable.  If there is one interface between every system in an application portfolio and “n” is the number of applications in the portfolio, then there will be approximately (n-1)2 / 2 interface connections.  In practice, not every system needs to interface with every other, but there may be multiple interfaces between systems for different types of data or needs.  This means for a manager of application systems that if they are managing 101 applications then there may be something like 5,000 interfaces.  A portfolio of 1001 applications may provide 500,000 interfaces to manage.  There are more manageable approaches to interface development than the traditional “point to point” data integration solutions that generate this type of complexity.

 

The use of a “hub and spoke” rather than “point to point” approach to interfaces changes the level of complexity of managing interfaces from exponential to linear.  The basic idea is to create a central data hub.  Instead of the need to translate from each system to every other system in the portfolio, interfaces only need to translate from the source system to the hub and then from the hub to the target system.  When a new system is added to the portfolio it is only necessary to add translations from the new system to the hub and from the hub back to the new system.  Translations to all the other systems already exist. This architectural technique to interface design makes a substantial difference to the complexity of managing an IT systems portfolio, and yet it had nothing really to do with introducing a new technology.

 


Show Me the Money! – Monetizing Data Management

October 23, 2011

On November 10 Dr. Peter Aiken will be coming to Northern NJ to speak at the DAMA NJ meeting about “Monetizing Data Management” – understanding the value of Data Management initiatives to an organization and the cost of not making Data Management investment. http://www.dama-nj.com/

At Data Management conferences I’ve been to over the last few years people are still more likely to attend sessions on improving data modeling techniques than on valuing data assets or data management investment, or the cost of poor quality data.  Maybe it’s just the nature of the conferences I attend, more technical than business oriented.  But I think every information technology professional should be prepared to explain, when asked or given the opportunity, why these investments are imperative.

If you can make it to Berkeley Heights on November 10, I hope you will attend Dr. Aiken’s presentation, which you will find to be a great use of your time.

 

 


When Technology Leads – The Tail Wagging the Dog

September 27, 2011

There is a great temptation to implement technology because it is “cool”, but for decades business and technology strategists (as well as most people in both business and technology) have realized that unless your business is to sell technology, the implementation of technology should be in support of business goals.  Sometimes, technology innovations can provide entirely new ways of performing business services and allow business differentiation.  In fact, there is a movement toward technology strategy being developed in collaboration with business strategy, rather than subsequently.

There are also some business functions that must be performed by every organization that are critical to business operation where, in practice, the technology organization tends to lead. One such area is “Business Continuity”, preparing for emergencies and business disruptions.  This is a business responsibility which cannot be simply delegated to the technology organization, and yet it requires significant specialized expertise, and in practice tends to be developed mostly by highly trained technologists.  The part of Business Continuity that deals with the recovery of data and computer systems is called “Disaster Recovery” and is a core technology operations capability.  So, the technology organizations tend to provide most of the resources to help business areas develop, test, and implement Business Continuity plans.  In practice, the tail wags the dog.

Best practice holds that Data Governance and Data Quality programs should be led by business managers, not IT, but there are key aspects of these programs which cannot be accomplished readily without technology support.  The key skills involved in performing these functions involve process improvement and data analysis capabilities, which are skills found most frequently in technology organizations.  Frequently, Data Governance and Data Quality initiatives get started in IT, but tend to be much more successful when led from business areas.