Page 42 - Data Architecture

P. 42

Chapter 1.3: The “Great Divide”

Fig. 1.3.1 The great divide.

Repetitive unstructured data are data that occur very often and whose records are almost
identical in terms of structure and content. There are many examples of repetitive
unstructured data—telephone call records, metered data, analog data, and so forth.

Nonrepetitive unstructured data are data that consist of records of data where the records
are not similar, in terms of either structure or content. There are many examples of
nonrepetitive unstructured data—e-mails, call center conversations, warranty claims, and
so forth.

The “Great Divide”

Between the two types of unstructured data is what can be termed the “great divide.”

The “great divide” is the demarcation of repetitive and nonrepetitive records, as seen in
the figure. At first glance, it does not appear that there should be a massive difference
between repetitive unstructured records and nonrepetitive unstructured records of data.
But such is not the case at all. There indeed is a HUGE difference between repetitive
unstructured data and nonrepetitive unstructured data.

The primary distinction between the two types of unstructured data is that repetitive
unstructured data focus its attention on the management of data in the Hadoop/big data
environment, whereas the attention of nonrepetitive unstructured data focuses its
attention on textual disambiguation of data. And as shall be seen, this difference in focus

makes a huge difference in how the data are perceived, how the data are used, and how
42

37 38 39 40 41 42 43 44 45 46 47