Page 182 - Data Architecture
P. 182

Chapter 4.7: Taxonomies
           is not a mistake.


           In order to explain this anomaly and explain why it is important, consider the following
           real example.


           In general, unstructured data can be considered to be repetitive and nonrepetitive.
           Repetitive unstructured data are unstructured data whose content and structure are highly
           repetitive. Into this classification of data fall clickstream data, analog data, metering data,
           and so forth. Into the other classification of data fall all data that are written. There are e-
           mails, call center data, customer feedback, contracts, and a whole host of other written

           and spoken narrative data.

           Now, consider that in the classification of narrative data, there appears a further

           subclassification of data. For all written data, there can be nonrepetitive written data and
           repetitive written data. For example, lawyers who write contracts use what is called
           “boilerplate.” A boilerplate contract is a contract where the primary body of the contract
           is predetermined. The lawyer only fills in a few details into the contract such as the name,
           address, and social security number of the recipient of the contract. There may be a few
           other terms that are negotiated, but at the end of the day, the boilerplate contracts are
           very, very similar.


           This then is an example of a repetitive nonrepetitive occurrence of data. The contract is
           nonrepetitive because it is in narrative form. But it is repetitive because it is essentially
           boilerplate.


           The reason why making the distinction between nonrepetitive nonrepetitive text and
           nonrepetitive repetitive text is that taxonomies apply to nonrepetitive nonrepetitive text.
           Some examples are needed here to explain this anomaly.



           Applicability of Taxonomies



           Taxonomies are most applicable to text such as e-mails, call center information,
           conversations, and other free-form narrative text. In free-form text, it is necessary to
           classify words using only the context associated by the taxonomy. As an example, the
           word ice cream is encountered. Ice cream belongs in the taxonomy of “dessert.” It is
           assumed that the e-mail is about food and meals and desserts. Another e-mail mentions
           cake. Cake too is a dessert. So, the e-mails are related to each other, even though the

           words—“ice cream” and “cake”—are very different. Using taxonomic classification in

                                                                                                               182
   177   178   179   180   181   182   183   184   185   186   187