Page 149 - Data Architecture
P. 149

Chapter 4.4: Unstructured Data
           Chapter 4.4



           Unstructured Data



           Abstract



           There are different definitions of big data. The definition used here is that big data
           encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman
           census” method, and stores data in an unstructured format. There are two major types of
           big data—repetitive big data and nonrepetitive big data. Only a small fraction of

           repetitive big data has business value, whereas almost all of nonrepetitive big data has
           business value. In order to achieve business value, the context of data in big data must be
           determined. Contextualization of repetitive big data is easily achieved. But
           contextualization of nonrepetitive data is done by means of textual disambiguation.


           Keywords



           Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive data;
           Contextualization; Textual disambiguation


           It is estimated that over 80% of the data in the corporation are unstructured information.
           There are many different forms of unstructured information. There is video. There is
           audio. There are images. But far and away the most interesting and useful for
           unstructured data is textual information.



           Textual Information—Everywhere



           Textual information is found everywhere in the corporation. Text is found in contracts, in
           e-mail, in reports, in memorandum, in human resource evaluations, and so forth. In a
           word, textual information makes up the fabric of corporate life, and that is true for every
           corporation.


           Unstructured information can be broken into two major categories—repetitive
           unstructured data and nonrepetitive unstructured data. Fig. 4.4.1 shows the categories
           that describe all corporate data.




                                                                                                               149
   144   145   146   147   148   149   150   151   152   153   154