What is different about Big Data Governance?

December 21, 2011

In most ways, Data Governance of Big Data is not different from normal Data Governance.  The benefits are the same.  The reasons for doing it are the same.  And, mostly, what needs to be done is the same.

What is different about Big Data Governance is that it’s about more data types, more sophisticated tools are needed, and the need for more metadata is critical.

First of all, Big Data Governance requires performing Governance over many different types of data, not just what’s in relational databases.  Certainly, the scope needs to include non-relational databases and unstructured data and documents.  This itself may require new tools to deal with these other technologies.

Secondly (and maybe this should be first because it is about data volumes), more sophisticated tools are needed to assess and profile data.  Big Data volumes are beyond human manageable scale and the traditional approaches of profiling and managing data primarily through observation becomes unfeasible.

Thirdly, the importance of collecting and documenting metadata becomes critical in order to automate as much as possible of the Data Governance activities.  This item is tied with the one above, in that more sophisticated tools can help to infer the metadata of the relatonships between the data, and metadata is required to automate the monitoring activities.

In summary, the strategic reasons for doing Data Governance remain the same and the way a Big Data organization is structured, but how the Data Governance of Big Data is actually performed may be very different.

Advertisements

How Useful are NoSQL Products?

September 7, 2011

In August I attended the NoSQL Conference in San Jose, California. This conference was about products and solutions that, primarily, don’t use relational databases.  The recent rise of interest comes from the Big Data space and includes areas that aren’t necessarily too big for relational databases, but that just don’t lend themselves to relational database solutions. Relational databases became the ubiquitous storage solution for data in application systems around the early to late 1990s.  However, I was a working programmer for more than 10 years before that and so I’ve worked with hierarchical databases and indexed file solutions, among other things.  In the late 1990s I had some very good experiences working with multidimensional databases for data marts, which are also not relational.  One of the keynotes for the NoSQL conference was from Robin Bloor on all the terrible things about relational databases and how it could be done better.  Every sentence out of his mouth was controversial and thought provoking.

The main question in my mind during the conference was “how are these NoSQL technologies and products useful to my customers?  For a large organization with a well established data management portfolio that is based on relational databases, what business problems would be better solved with something else?”

The Big Data technology movement was started around the Hadoop file system and Map Reduce applications to solve problems of searching web data.  This technology solution is used by many web (Google, Yahoo, Amazon) and social media companies to manage and search vast amounts of data across multiple servers.  It introduces solutions for storing and searching unstructured data cheaply distributed across many servers.  How might Hadoop and Map Reduce be of interest to traditional Data Management organizations?  It introduces search of unstructured data, distributed processing, and possibly Cloud technology (if the distributed servers are in the Cloud).  This gets into the idea of being able to search through vast amounts of organizational data that might previously have seemed too trivial to spend money to store or too expensive to search.  There are quite a few Business Intelligence solutions that don’t use relational database technology or which combine it with other databases and technologies.  The most interesting aspect from a technology perspective is the move to distributed processing engines.

A problem area that doesn’t seem to lend itself to relational databases is, ironically, understanding how two people or things are related to one another.  Examples of this problem include analyzing the degrees of separation of two participants in a social network or understanding the relationship between a suspected terrorist and someone who calls them on the telephone. These types of problems are better solved using a graph or node database with a recursive analytical engine. By the way, when was the last time you wrote a recursive program?  Better get out your Knuth algorithms book.

Multidimensional databases pre-calculate all (or most) summaries along multiple hierarchies or taxonomies such as customer, product, organizational structure, or accounting bucket.  They are blindingly FAST for responding to queries, but not dynamic and require the step where all the calculations are done after the data is loaded.  Applications for these solutions are data mart cubes.  My experience has been particularly good supporting the Finance organization analytical needs.  The capabilities of these databases can be mimicked in a relational database using Kimball’s dimensional modeling and summary tables.

XML databases deal well with the problem of storing and searching data where the structure of the data may be unknown in advance.  Applications around data messages and documents do well using XML database solutions.

Another day I’ll blog about some of the inherent problems of relational databases with speed and volumes, but the main point is that organizations are finding it worthwhile to expand their database solutions beyond just relational database management systems.


Data Integration and Data Governance for Cloud Computing

July 18, 2011

I was recently reviewing the Data Integration architecture for a client and they asked me what they should be looking at for Data Integration when they start using Cloud Computing.  The simple, and boring, answer is that you should be able to use the same solutions in the Cloud as you are in a traditional server/database environment.  Your Enterprise Service Bus is specifically meant to be able to integrate across heterogeneous technologies.  To integrate data from a Cloud Computing environment with other data should require adapters for the specific technologies of the servers where the data is located, either for HADOOP, other specialized file systems, or specialized database management systems.

The answer for Data Governance of Cloud Computing is even more boring – there aren’t any changes.  Data Governance is about managing and processes and is technology independent.  You may need some specialized tools for profiling data and reporting data quality metrics, but the Data Governance process itself doesn’t change between technologies.