Page 162 - Data Architecture

P. 162

Chapter 4.5: Contextualizing Repetitive Unstructured Data
Chapter 4.5

Contextualizing Repetitive Unstructured Data

Abstract

There are different definitions of big data. The definition used here is that big data
encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman
census” method, and stores data in an unstructured format. There are two major types of
big data—repetitive big data and nonrepetitive big data. Only a small fraction of

repetitive big data has business value, whereas almost all of nonrepetitive big data has
business value. In order to achieve business value, the context of data in big data must be
determined. Contextualization of repetitive big data is easily achieved. But
contextualization of nonrepetitive data is done by means of textual disambiguation.

Keywords

Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive Data;
Contextualization; Textual disambiguation

In order to be used for analysis, all unstructured data need to be contextualized. This is as
true for repetitive unstructured data as it is for nonrepetitive unstructured data. But there
is a big difference between contextualizing repetitive unstructured data and nonrepetitive
unstructured data. That difference is that contextualizing repetitive unstructured data is
easy and straightforward to do, whereas contextualizing nonrepetitive unstructured data
is anything but easy to do.

Parsing Repetitive Unstructured Data

In the case of repetitive unstructured data, the data are read, usually in Hadoop. After the
block of data is read, the data are then parsed. Given the repetitive nature of the data,
parsing the data is straightforward. The record is small, and the context of the record is
easy to find.

The process of parsing and contextualizing the data found in big data can be done with a
commercial utility or can be a custom-written program.

162

157 158 159 160 161 162 163 164 165 166 167