Page 42 - Data Architecture
P. 42

Chapter 1.3: The “Great Divide”


























               Fig. 1.3.1 The great divide.


           Repetitive unstructured data are data that occur very often and whose records are almost
           identical in terms of structure and content. There are many examples of repetitive
           unstructured data—telephone call records, metered data, analog data, and so forth.


           Nonrepetitive unstructured data are data that consist of records of data where the records
           are not similar, in terms of either structure or content. There are many examples of
           nonrepetitive unstructured data—e-mails, call center conversations, warranty claims, and
           so forth.



           The “Great Divide”



           Between the two types of unstructured data is what can be termed the “great divide.”


           The “great divide” is the demarcation of repetitive and nonrepetitive records, as seen in
           the figure. At first glance, it does not appear that there should be a massive difference
           between repetitive unstructured records and nonrepetitive unstructured records of data.
           But such is not the case at all. There indeed is a HUGE difference between repetitive
           unstructured data and nonrepetitive unstructured data.


           The primary distinction between the two types of unstructured data is that repetitive
           unstructured data focus its attention on the management of data in the Hadoop/big data
           environment, whereas the attention of nonrepetitive unstructured data focuses its
           attention on textual disambiguation of data. And as shall be seen, this difference in focus

           makes a huge difference in how the data are perceived, how the data are used, and how
                                                                                                                42
   37   38   39   40   41   42   43   44   45   46   47