Page 162 - Data Architecture
P. 162

Chapter 4.5: Contextualizing Repetitive Unstructured Data
           Chapter 4.5



           Contextualizing Repetitive Unstructured Data



           Abstract



           There are different definitions of big data. The definition used here is that big data
           encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman
           census” method, and stores data in an unstructured format. There are two major types of
           big data—repetitive big data and nonrepetitive big data. Only a small fraction of

           repetitive big data has business value, whereas almost all of nonrepetitive big data has
           business value. In order to achieve business value, the context of data in big data must be
           determined. Contextualization of repetitive big data is easily achieved. But
           contextualization of nonrepetitive data is done by means of textual disambiguation.


           Keywords



           Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive Data;
           Contextualization; Textual disambiguation


           In order to be used for analysis, all unstructured data need to be contextualized. This is as
           true for repetitive unstructured data as it is for nonrepetitive unstructured data. But there
           is a big difference between contextualizing repetitive unstructured data and nonrepetitive
           unstructured data. That difference is that contextualizing repetitive unstructured data is
           easy and straightforward to do, whereas contextualizing nonrepetitive unstructured data
           is anything but easy to do.



           Parsing Repetitive Unstructured Data



           In the case of repetitive unstructured data, the data are read, usually in Hadoop. After the
           block of data is read, the data are then parsed. Given the repetitive nature of the data,
           parsing the data is straightforward. The record is small, and the context of the record is
           easy to find.


           The process of parsing and contextualizing the data found in big data can be done with a
           commercial utility or can be a custom-written program.


                                                                                                               162
   157   158   159   160   161   162   163   164   165   166   167