Last month I participated in a DataVersity webinar on Big Data Modeling . There are a lot of definitions necessary in that discussion. What is meant by Big Data? What is meant by modeling? Does modeling mean entity-relationship modeling only or something broader?
The term “Big Data” implies an emphasis on high volumes of data. What constitutes big volumes for an organization seems to be dependent on the organization and its history. The Wikipedia definition of “Big Data” says that an organization’s data is “big” when it can’t be comfortably handled by on hand technology solutions. Since the current set of relational database software can comfortably handle terabytes of data and even desktop productivity software can comfortably handle gigabytes of data, “big” implies many terabytes at least.
However, the consensus on the definition of “Big Data” seems to be with the Gartner Group definition that says that “Big Data” implies large volume, variety, and velocity of data. Therefore, “Big Data” means not just data located in relational databases but files, documents, email, web traffic, audio, video, and social media, as well. The various types of data provides the “variety”, and not just data in an organization’s own data center but in the cloud and data from external sources as well as data on mobile devices.
The third aspect of “Big Data” is the velocity of data. The ubiquity of sensor and global position monitoring information means a vast amount of information available at an ever increasing rate from both internal and external sources. How quickly can this barrage of information be processed? How much of it needs to be retained and for how long?
What is “data modeling”? Most people seem to picture this activity as synonymous with “entity relationship modeling”. Is entity relationship modeling useful for purposes outside of relational database design? If modeling is the process of creating a simpler representation of something that does or might exist, we can use modeling for communicating information about something in a simpler way than presenting the thing itself. So modeling is used for communicating. Entity relationship modeling is useful to communicate information about the attributes of the data and the types of relationships allowed between the pieces of data. This seems like it might be useful to communicate ideas outside of just relational databases.
Data modeling is also used to design data structures at various levels of abstraction from conceptual to physical. When we differentiate between modeling and design, we are mostly just differentiating between logical design and design closer to the physical implementation of a database. So data modeling is also useful for design.
In the next part of this blog I’ll get back to the question of “Big Data Modeling.”