Big Data Modeling – part 1 – Defining “Big Data” and “Data Modeling”

July 15, 2012

Last month I participated in a DataVersity webinar on Big Data Modeling .  There are a lot of definitions necessary in that discussion. What is meant by Big Data? What is meant by modeling? Does modeling mean entity-relationship modeling only or something broader?

The term “Big Data” implies an emphasis on high volumes of data. What constitutes big volumes for an organization seems to be dependent on the organization and its history.  The Wikipedia definition of “Big Data” says that an organization’s data is “big” when it can’t be comfortably handled by on hand technology solutions.  Since the current set of relational database software can comfortably handle terabytes of data and even desktop productivity software can comfortably handle gigabytes of data, “big” implies many terabytes at least.

However, the consensus on the definition of “Big Data” seems to be with the Gartner Group definition that says that “Big Data” implies large volume, variety, and velocity of data.  Therefore, “Big Data” means not just data located in relational databases but files, documents, email, web traffic, audio, video, and social media, as well.  The various types of data provides the “variety”, and not just data in an organization’s own data center but in the cloud and data from external sources as well as data on mobile devices.

The third aspect of “Big Data” is the velocity of data.  The ubiquity of sensor and global position monitoring information means a vast amount of information available at an ever increasing rate from both internal and external sources.  How quickly can this barrage of information be processed?  How much of it needs to be retained and for how long?

What is “data modeling”? Most people seem to picture this activity as synonymous with “entity relationship modeling”.  Is entity relationship modeling useful for purposes outside of relational database design?  If modeling is the process of creating a simpler representation of something that does or might exist, we can use modeling for communicating information about something in a simpler way than presenting the thing itself. So modeling is used for communicating.  Entity relationship modeling is useful to communicate information about the attributes of the data and the types of relationships allowed between the pieces of data.  This seems like it might be useful to communicate ideas outside of just relational databases.

Data modeling is also used to design data structures at various levels of abstraction from conceptual to physical. When we differentiate between modeling and design, we are mostly just differentiating between logical design and design closer to the physical implementation of a database. So data modeling is also useful for design.

In the next part of this blog I’ll get back to the question of “Big Data Modeling.”


Data Virtualization – Part 2 – Data Caching

June 10, 2012

Another type of Data Virtualization that is less frequently discussed than Business Intelligence (see my previous blog) has to do with having data available in the computer’s memory, or as close to it as possible, in order to tremendously speed up the speed of both data access and update.

A simplistic way of thinking about the relative time to retrieve data is that if it takes a certain amount of time in nanoseconds to retrieve something in memory then it will be something like 1000 times that to retrieve data from disk (milliseconds). Depending on the infrastructure configuration, retrieving data over a LAN or from the internet may be ten to 1000 times slower than that. If I load my most heavily used data into memory in advance, or something that behaves like memory, then my processing of that data should be speeded up by multiple orders of magnitude.  Using solid state disk for heavily used data can achieve access and update response times similar to having data in memory.  Computer memory, as well as solid state drives, although not as cheap as traditional disk, are certainly substantially less expensive than they used to be.

Using memory caching of data can be done using traditional databases and sequential processing, and multiple orders of magnitude response time improvements can be achieved.  However, really spectacular performance is possible if we combine memory caching with parallel computing and databases designed around data caching, such as GemFire.  This does require that we develop systems using these new technologies and approaches in order to take advantage of parallel processing and optimized data caching, but the results can be blazingly fast.


People who Tweet about Data Management

April 30, 2012

Data Management & Architecture

Karen Lopez @datachick

Neil Raden @NeilRaden

Robin Bloor @robinbloor

M. David Allen @mdavidallen

Sue Geuens @suegeuens

Mehmet Orun @DataMinstrel

Alec Sharp @alecsharp

Loretta Mahon Smith @silverdata

Eva Smith @datadeva

Corine Jasonius @DataGenie

Peter Aiken @paiken

Tony Shaw @tonyshaw

Glenn Thomas @Warduke

Bonnie O’Neil @bonnieoneil

Rob Paller @RobPaller

Pete Rivett @rivettp

Charles T. Betz @CharlesTBetz

Tracie Larsen @RelatedStuff

Wayne Eckerson @weckerson

Julian Keith Loren @jkloren

Christophe @mydatanews

Steve Francia @spf13

Gorm Braavig @gormb

Jim Finwick @jimfinwick

Alexej Freund @alexej_freund

Corinna Martinez @Futureatti

Data Quality

Jim Harris @ocdqblog – blog

David Loshin @davidloshin – blog

Rich Murnane @murnane

Daragh O Brien @daraghobrien

Jacqueline Roberts @JackieMRoberts

Steve Tuck @SteveTuck

Vish Agashe @VishAgashe

Julian Schwarzenbach @jschwa1

Henrik L. Sorensen @hlsdk

MDM and Data Governance

Jill Dyche @jilldyche – blog

Charles Blyth @charlesblyth

Steve Sarsfield @stevesarsfield – blog

Dan Power @dan_power

Philip Tyler @tylep0

Business Intelligence and Analytics

Marcus Borba @marcusborba

Tamara Dull @tamaradull

Claudia Imhoff @Claudia_Imhoff – blog

Scott Wallask @BI_expert

Peter Thomas @PeterJThomas – blog

Barney Finucane @bfinucane

Matt Winkleman @mattwinkleman

Stray_Cat @Stray_Cat

Brett2point0 @Brett2point0

Risk Management

Peter Went @Bank_Risk

Joshua Corman @joshcorman

Michael Rasmussen @GRCPundit

Nenshad Bardoliwalla @nenshad

Gary Byrne @GRCexpert

Helmut Schindlwick @Schindwick

Technology Companies and Data Organizations

Oracle @Oracle

DAMA international @DAMA_I

McKinsey on BT @mck_biztech

SmartData Collective @SmartDataCo

DataFlux InSight @Datafluxinsight

Gartner @Gartner_inc

TDWI @TDWI

Scientific  Computing @SciCom

Wearecloud @wearecloud

CloudCamp @cloudcamp

Panorama Software @PanoramaSW

Data Hole @datahole

BI Knowledge Base @biknowledgebase

EnterpriseArchitects @enterprisearchitects

DataQualityPro.com @dataqualitypro

RSA Archer eGRC @ArcherGRC

Exobox @Exobox_Security

EA_Consultant @EA_Consultant

Cloudbook @cloudbook

ID Experts @idexperts

IAIDQ @iaidq

EMC Forum @EMCForums

Data Junkies @datajunkies

True Finance Data @truefinancedata

Madam @TheMDMNetwork

IBM Initiate @IBMInitiate

Accelus_GRC @PaisleyGRC

DQ Asia Pacific @DQAsiaPacific

Data Guide @DataGuide

PCI PA-DSS Data @DataAssurant

DataFlux Corporation @DataFlux


Data Virtualization – Part 1 – Business Intelligence

March 26, 2012

The big transformation that we’re all dealing with in technology today is virtualization.  There are many aspects to virtualization: infrastructure, systems, organization, office, applications. When you search on the internet for “data virtualization,” most of the references are regarding business intelligence and data warehousing uses.  In part 2 of this blog I will talk about data virtualization and transaction processing.

In the day, when I used to build data warehouses (1990s+), there was a reference to a concept of “federated data warehouses”, where data in the logical warehouse would be separated physically either with the same schemas in multiple instances or different types of data in different locations.  The thought was that the data would be physically separate but brought together real time for reporting.  We also used to call that “data warehouses that don’t work”.  After all, the reason we created data warehouses in the first place was that we needed to instantiate the data consolidation in order to make the response time reasonable when trying to report on millions of records. No, really, the response time on these “federated data warehouse” systems used to be many minutes or more.

Now, however, the technologies involved have made huge leaps in capabilities.  The vendors have put thousands and thousands of man hours into how to make real time integration and reporting work.  There are many techniques involving specialized software and hardware (data appliances) that enable these capabilities, query optimization, distributed processing, and other optimization techniques, and hybrid solutions between pure virtualization and instantiation.  Specialized tuning is necessary, and the fastest solutions involve instantiating the consolidated data in a central place.

Ultimately, having to do a project to incorporate new data into the data warehouse physically isn’t responsive enough to the business need for information.  Better to have a short term solution that allows for the quick incorporation of new data and then, if there is a continued need for the data in question and you want to speed up the response, possibly integrate the additional data into the physical data warehouse.

The problems being solved now for business intelligence and data virtualizations include real time data integration of multiple regional instances of a data warehouse, integrating data of different types and kinds, and integrating data from a data warehouse with big data and cloud data.  This enables much more responsive business intelligence and analytical solutions to business requests without having to always instantiate all data for analysis into a central, single, enterprise data warehouse.



What is different about Big Data Governance?

December 21, 2011

In most ways, Data Governance of Big Data is not different from normal Data Governance.  The benefits are the same.  The reasons for doing it are the same.  And, mostly, what needs to be done is the same.

What is different about Big Data Governance is that it’s about more data types, more sophisticated tools are needed, and the need for more metadata is critical.

First of all, Big Data Governance requires performing Governance over many different types of data, not just what’s in relational databases.  Certainly, the scope needs to include non-relational databases and unstructured data and documents.  This itself may require new tools to deal with these other technologies.

Secondly (and maybe this should be first because it is about data volumes), more sophisticated tools are needed to assess and profile data.  Big Data volumes are beyond human manageable scale and the traditional approaches of profiling and managing data primarily through observation becomes unfeasible.

Thirdly, the importance of collecting and documenting metadata becomes critical in order to automate as much as possible of the Data Governance activities.  This item is tied with the one above, in that more sophisticated tools can help to infer the metadata of the relatonships between the data, and metadata is required to automate the monitoring activities.

In summary, the strategic reasons for doing Data Governance remain the same and the way a Big Data organization is structured, but how the Data Governance of Big Data is actually performed may be very different.


Data Governance Certification or Data Stewardship Certification?

December 6, 2011

The Data Management Association (DAMA) is now offering a Data Governance Certification as an option of their current Certified Data Management Professional, which is a natural extension since the test for Data Governance already existed under their current certification process and merely requires a specific configuration of test modules. But what does Data Governance certification mean and is that really what is needed? The Data Governance certification offered by DAMA is, to a great extent, based on the Data Governance practice area described in the DAMA Data Management Body of Knowledge document (DMBOK) which was published in 2009. That focuses on the best practices for a Data Governance program and organization in terms of what activities it should be performing, what tools it should be using, and what roles and responsibilities should be present. But do we need to be certifying that people know how to set up a Data Governance program? Rather, should we be focusing on what the people who need to perform Data Governance for an organization should be doing – the Data Stewards? Certifying Data Stewards may not be something that should be done generically. Rather, an organization may want to certify that the identified Data Stewards within their organization are knowledgeable in the agreed standard operating procedures for Data Stewards in that particular organization. In summary, having a Data Governance certification makes sense that identifies individuals who are familiar with how, in general, a Data Governance organization should be created and operated. It makes more sense for an organization to certify their Data Stewards on the particular processes unique to their organization.