Data Virtualization – Part 1 – Business Intelligence

March 26, 2012

The big transformation that we’re all dealing with in technology today is virtualization.  There are many aspects to virtualization: infrastructure, systems, organization, office, applications. When you search on the internet for “data virtualization,” most of the references are regarding business intelligence and data warehousing uses.  In part 2 of this blog I will talk about data virtualization and transaction processing.

In the day, when I used to build data warehouses (1990s+), there was a reference to a concept of “federated data warehouses”, where data in the logical warehouse would be separated physically either with the same schemas in multiple instances or different types of data in different locations.  The thought was that the data would be physically separate but brought together real time for reporting.  We also used to call that “data warehouses that don’t work”.  After all, the reason we created data warehouses in the first place was that we needed to instantiate the data consolidation in order to make the response time reasonable when trying to report on millions of records. No, really, the response time on these “federated data warehouse” systems used to be many minutes or more.

Now, however, the technologies involved have made huge leaps in capabilities.  The vendors have put thousands and thousands of man hours into how to make real time integration and reporting work.  There are many techniques involving specialized software and hardware (data appliances) that enable these capabilities, query optimization, distributed processing, and other optimization techniques, and hybrid solutions between pure virtualization and instantiation.  Specialized tuning is necessary, and the fastest solutions involve instantiating the consolidated data in a central place.

Ultimately, having to do a project to incorporate new data into the data warehouse physically isn’t responsive enough to the business need for information.  Better to have a short term solution that allows for the quick incorporation of new data and then, if there is a continued need for the data in question and you want to speed up the response, possibly integrate the additional data into the physical data warehouse.

The problems being solved now for business intelligence and data virtualizations include real time data integration of multiple regional instances of a data warehouse, integrating data of different types and kinds, and integrating data from a data warehouse with big data and cloud data.  This enables much more responsive business intelligence and analytical solutions to business requests without having to always instantiate all data for analysis into a central, single, enterprise data warehouse.



What is different about Big Data Governance?

December 21, 2011

In most ways, Data Governance of Big Data is not different from normal Data Governance.  The benefits are the same.  The reasons for doing it are the same.  And, mostly, what needs to be done is the same.

What is different about Big Data Governance is that it’s about more data types, more sophisticated tools are needed, and the need for more metadata is critical.

First of all, Big Data Governance requires performing Governance over many different types of data, not just what’s in relational databases.  Certainly, the scope needs to include non-relational databases and unstructured data and documents.  This itself may require new tools to deal with these other technologies.

Secondly (and maybe this should be first because it is about data volumes), more sophisticated tools are needed to assess and profile data.  Big Data volumes are beyond human manageable scale and the traditional approaches of profiling and managing data primarily through observation becomes unfeasible.

Thirdly, the importance of collecting and documenting metadata becomes critical in order to automate as much as possible of the Data Governance activities.  This item is tied with the one above, in that more sophisticated tools can help to infer the metadata of the relatonships between the data, and metadata is required to automate the monitoring activities.

In summary, the strategic reasons for doing Data Governance remain the same and the way a Big Data organization is structured, but how the Data Governance of Big Data is actually performed may be very different.


Data Governance Certification or Data Stewardship Certification?

December 6, 2011

The Data Management Association (DAMA) is now offering a Data Governance Certification as an option of their current Certified Data Management Professional, which is a natural extension since the test for Data Governance already existed under their current certification process and merely requires a specific configuration of test modules. But what does Data Governance certification mean and is that really what is needed? The Data Governance certification offered by DAMA is, to a great extent, based on the Data Governance practice area described in the DAMA Data Management Body of Knowledge document (DMBOK) which was published in 2009. That focuses on the best practices for a Data Governance program and organization in terms of what activities it should be performing, what tools it should be using, and what roles and responsibilities should be present. But do we need to be certifying that people know how to set up a Data Governance program? Rather, should we be focusing on what the people who need to perform Data Governance for an organization should be doing – the Data Stewards? Certifying Data Stewards may not be something that should be done generically. Rather, an organization may want to certify that the identified Data Stewards within their organization are knowledgeable in the agreed standard operating procedures for Data Stewards in that particular organization. In summary, having a Data Governance certification makes sense that identifies individuals who are familiar with how, in general, a Data Governance organization should be created and operated. It makes more sense for an organization to certify their Data Stewards on the particular processes unique to their organization.


What I Learned About Data Management in Japan

December 6, 2011

I just returned from speaking at the DAMA Japan conference. A couple of the officers saw a presentation I gave at the DAMA International conference last April on performing Data Management maturity capability assessments and they asked me to repeat the speech at their conference in Tokyo.  The Data Management Association (DAMA dama.org) created a Body of Knowledge document in 2009 (DMBOK) that was translated into Japanese in 2010.  This conference had a lot of focus on the DMBOK, and so they were very interested in how I had been using it to assess Data Management maturity for our client organizations.  About 90 people attended my presentation. The people at the conference and the DAMA Japan organization were very interested in measuring how well their organizations were doing with Data Management, especially against how other organizations are doing.

One thing I learned during the conference was how they were using the “three R’s” concept: “reuse, recycle, reduce” concerning Data Management and adding a fourth “R” for “respect”.  In organizational data it is generally agreed that 80% or more of an organization’s data is ROT: redundant, out-dated, or trivial.  More of our planning time should be spent trying to reuse or remove data structures than on trying to create new data structures.  They had added an extra “R” to “reuse, recycle, and reduce” to stand for “respect”, a very Japanese consideration and always worth attention.

Another thing that I learned is that some of my Japanese colleagues will take a day to have a brainstorming session on different IT and strategy concepts, which sessions might very well happen in a hot spring or bath house.  Concerning Enterprise Architecture, they had recently had such a session and decided that it was critical that Enterprise Architecture include business innovation, employee motivation, and technology innovation.  This summary was shared with me after we had all consumed a great deal of sake and other alcohol and they were very willing to try to share their ideas in English.

The people I met at the Data Management conference seemed not very interested in Data Governance but, as I said extremely interested in assessing Data Management practices.  And it appears that adding a banquet or hot bath makes the discussion of data management and strategy even more insightful.


The Problem With Point to Point Interfaces

November 21, 2011

 

The average corporate computing environment is comprised of hundreds to thousands of disparate and changing computer systems that have been built, purchased, and acquired.  The data from these various systems needs to be  integrated for reporting and analysis, shared for business transaction processing, and converted from one system format to another when old systems are replaced and new systems are acquired.  Effectively managing the data passing between systems is a major challenge and concern for every Information Technology organization.

 

Most Data Management focus is around data stored in structures such as databases and files, and a much smaller focus on the data flowing between and around the data structures.  Yet, because of the prevalence of purchasing rather than building application solutions, the management of the “data in motion” in organizations is rapidly becoming one of the main concerns for business and IT management.  As additional systems are added into an organization’s portfolio the complexity of the interfaces between the systems grows dramatically, making management of those interfaces overwhelming.

 

Traditional interface development quickly leads to a level of complexity that is unmanageable.  If there is one interface between every system in an application portfolio and “n” is the number of applications in the portfolio, then there will be approximately (n-1)2 / 2 interface connections.  In practice, not every system needs to interface with every other, but there may be multiple interfaces between systems for different types of data or needs.  This means for a manager of application systems that if they are managing 101 applications then there may be something like 5,000 interfaces.  A portfolio of 1001 applications may provide 500,000 interfaces to manage.  There are more manageable approaches to interface development than the traditional “point to point” data integration solutions that generate this type of complexity.

 

The use of a “hub and spoke” rather than “point to point” approach to interfaces changes the level of complexity of managing interfaces from exponential to linear.  The basic idea is to create a central data hub.  Instead of the need to translate from each system to every other system in the portfolio, interfaces only need to translate from the source system to the hub and then from the hub to the target system.  When a new system is added to the portfolio it is only necessary to add translations from the new system to the hub and from the hub back to the new system.  Translations to all the other systems already exist. This architectural technique to interface design makes a substantial difference to the complexity of managing an IT systems portfolio, and yet it had nothing really to do with introducing a new technology.

 


If the Data Quality got better but no one measured …

November 2, 2011

There is an old philosophical question: “If a tree fell in the forest but no one heard it, did it make a noise?”  The basis of the question being that every time we’ve seen a tree fall in the past it has made a noise, but if no one heard it fall then maybe this one time it didn’t … but you couldn’t prove it either way.

Centrally important to certain areas of Data Management such as Data Governance, Master Data Management, and especially Data Quality is the absolute importance of metrics and measures.  You can’t demonstrate that the quality of data improved unless you measure it.  You can’t report the benefit of your program unless you measure it.  And, showing improvement means that you need to measure both before and after to calculate the improvement.

Senior executives in organizations want to know what value a technology investment brings them.  And the ways to show value are increased revenue, lowered cost,  and reduced risk (which can include regulatory compliance). Without reporting financial benefit to management few organizations are willing to support ongoing improvement projects for multiple years.  Also, it is important to report both what the financial benefit has been and what additional opportunities remain – management is very happy to  declare success and terminate the program unless you are also reporting what remains to be done.


Show Me the Money! – Monetizing Data Management

October 23, 2011

On November 10 Dr. Peter Aiken will be coming to Northern NJ to speak at the DAMA NJ meeting about “Monetizing Data Management” – understanding the value of Data Management initiatives to an organization and the cost of not making Data Management investment. http://www.dama-nj.com/

At Data Management conferences I’ve been to over the last few years people are still more likely to attend sessions on improving data modeling techniques than on valuing data assets or data management investment, or the cost of poor quality data.  Maybe it’s just the nature of the conferences I attend, more technical than business oriented.  But I think every information technology professional should be prepared to explain, when asked or given the opportunity, why these investments are imperative.

If you can make it to Berkeley Heights on November 10, I hope you will attend Dr. Aiken’s presentation, which you will find to be a great use of your time.

 

 


When Technology Leads – The Tail Wagging the Dog

September 27, 2011

There is a great temptation to implement technology because it is “cool”, but for decades business and technology strategists (as well as most people in both business and technology) have realized that unless your business is to sell technology, the implementation of technology should be in support of business goals.  Sometimes, technology innovations can provide entirely new ways of performing business services and allow business differentiation.  In fact, there is a movement toward technology strategy being developed in collaboration with business strategy, rather than subsequently.

There are also some business functions that must be performed by every organization that are critical to business operation where, in practice, the technology organization tends to lead. One such area is “Business Continuity”, preparing for emergencies and business disruptions.  This is a business responsibility which cannot be simply delegated to the technology organization, and yet it requires significant specialized expertise, and in practice tends to be developed mostly by highly trained technologists.  The part of Business Continuity that deals with the recovery of data and computer systems is called “Disaster Recovery” and is a core technology operations capability.  So, the technology organizations tend to provide most of the resources to help business areas develop, test, and implement Business Continuity plans.  In practice, the tail wags the dog.

Best practice holds that Data Governance and Data Quality programs should be led by business managers, not IT, but there are key aspects of these programs which cannot be accomplished readily without technology support.  The key skills involved in performing these functions involve process improvement and data analysis capabilities, which are skills found most frequently in technology organizations.  Frequently, Data Governance and Data Quality initiatives get started in IT, but tend to be much more successful when led from business areas.


How Useful are NoSQL Products?

September 7, 2011

In August I attended the NoSQL Conference in San Jose, California. This conference was about products and solutions that, primarily, don’t use relational databases.  The recent rise of interest comes from the Big Data space and includes areas that aren’t necessarily too big for relational databases, but that just don’t lend themselves to relational database solutions. Relational databases became the ubiquitous storage solution for data in application systems around the early to late 1990s.  However, I was a working programmer for more than 10 years before that and so I’ve worked with hierarchical databases and indexed file solutions, among other things.  In the late 1990s I had some very good experiences working with multidimensional databases for data marts, which are also not relational.  One of the keynotes for the NoSQL conference was from Robin Bloor on all the terrible things about relational databases and how it could be done better.  Every sentence out of his mouth was controversial and thought provoking.

The main question in my mind during the conference was “how are these NoSQL technologies and products useful to my customers?  For a large organization with a well established data management portfolio that is based on relational databases, what business problems would be better solved with something else?”

The Big Data technology movement was started around the Hadoop file system and Map Reduce applications to solve problems of searching web data.  This technology solution is used by many web (Google, Yahoo, Amazon) and social media companies to manage and search vast amounts of data across multiple servers.  It introduces solutions for storing and searching unstructured data cheaply distributed across many servers.  How might Hadoop and Map Reduce be of interest to traditional Data Management organizations?  It introduces search of unstructured data, distributed processing, and possibly Cloud technology (if the distributed servers are in the Cloud).  This gets into the idea of being able to search through vast amounts of organizational data that might previously have seemed too trivial to spend money to store or too expensive to search.  There are quite a few Business Intelligence solutions that don’t use relational database technology or which combine it with other databases and technologies.  The most interesting aspect from a technology perspective is the move to distributed processing engines.

A problem area that doesn’t seem to lend itself to relational databases is, ironically, understanding how two people or things are related to one another.  Examples of this problem include analyzing the degrees of separation of two participants in a social network or understanding the relationship between a suspected terrorist and someone who calls them on the telephone. These types of problems are better solved using a graph or node database with a recursive analytical engine. By the way, when was the last time you wrote a recursive program?  Better get out your Knuth algorithms book.

Multidimensional databases pre-calculate all (or most) summaries along multiple hierarchies or taxonomies such as customer, product, organizational structure, or accounting bucket.  They are blindingly FAST for responding to queries, but not dynamic and require the step where all the calculations are done after the data is loaded.  Applications for these solutions are data mart cubes.  My experience has been particularly good supporting the Finance organization analytical needs.  The capabilities of these databases can be mimicked in a relational database using Kimball’s dimensional modeling and summary tables.

XML databases deal well with the problem of storing and searching data where the structure of the data may be unknown in advance.  Applications around data messages and documents do well using XML database solutions.

Another day I’ll blog about some of the inherent problems of relational databases with speed and volumes, but the main point is that organizations are finding it worthwhile to expand their database solutions beyond just relational database management systems.