Continuing my discussion of “Big Data Modeling,” what is it and is it any different from normal data modeling? Ultimately, the questions come down to: is there a role for a modeler on Big Data projects and what does that role look like?
Modeling for Communication –
If modeling is the process of creating a simpler representation of something that does or might exist, we can use modeling for communicating information about something in a simpler way than presenting the thing itself. After all, we aren’t limited in describing a computer system to presenting only the system itself, but we present various models to communicate different aspects of what is or what might be.
Modeling Semantics –
On Big Data projects, as with all data oriented projects, it is necessary to communicate logical and semantic concepts about the data involved in the project. This may involve, but is not limited to, models presented in entity-relationship diagrams. The data modeling needs, in fact, are not limited to design of structures even but certainly includes data flows, process models, and other kinds of models. This also would include any necessary taxonomy and ontology models.
Modeling Design –
Prior to construction it is necessary to represent (design) the data structures needed for the persistent as well as transitory data used in the project. Persistent data structures include those in files or databases. Transitory data structures include the messages and streams of data passing into and out of the organization as well as between applications. For data being received from other organizations or groups, this may be receiving information rather than designing. This is, or is close to, the physical design level of the implementation including the design of database tables and structures, file layouts, metadata tags, message layouts, data services, etc.
Modeling Virtual Layers –
There is a big movement in systems development in virtualizing layers of the infrastructure, where the view presented to programmers or users may be different from the actual physical implementation. This move toward creating virtual layers that can change independently is true in data design as well. It is necessary to design, or model, the presentation of information to the systems users (client experience) and programmers independently of the modeling of the physical data structures. This is more necessary for Big Data because it includes designing levels of virtualization for normalizing or merging data of different types into a consistent format. In addition to the modeling of the virtual data layers there is a need for the translation from the physical data structures to the virtual level such as between relational database structures and web service objects.
Modeling Mappings and Transformations –
t is necessary in any design that involves the movement of data between systems, whether Big Data or not, to specifiy the lineage in the flow of data from physical data structure to physical data structure including the mappings and transformation rules necessary from persistent data structure to message to persistent data structure, as necessary. This level of design requires an understanding of both the physical implementation and the business meaning of the data. We don’t usually call this activity modeling but strictly design.
Ultimately, there is a lot of work for a data modeler on Big Data projects, although little of it may look like creating entity relational models. There is the need to create models for communicating ideas, for designing physical implementation solutions, for designing levels of virtualization, and for mapping between these models and designs.