Page 174 - Data Architecture
P. 174

Chapter 4.6: Textual Disambiguation





















               Fig. 4.6.9 Preprocessing text.


           E-mails—A Special Case




           E-mails are a special case of nonrepetitive unstructured data. E-mails are special because
           everybody has them and because there are so many of them. Another reason why e-mails
           are special is that e-mails carry with them an enormous amount of system overhead that
           is useful to the system and no one else. Also, e-mails carry a lot of valuable information
           when it comes to customer's attitudes and activities.


           It is possible to simply send e-mails into textual disambiguation. But such an exercise is
           fruitless because of the spam and blather that are found in e-mails. Spam is the
           nonbusiness relevant information that is generated outside the corporation. Blather is the
           internally generated correspondence that is nonbusiness related. For example, blather
           contains the jokes that are sent throughout the corporation.


           In order to use textual disambiguation effectively, the spam, blather, and system
           information need to be filtered out. Otherwise, the system becomes overwhelmed
           meaningless information.


           Fig. 4.6.10 shows that there is a filter to remove unnecessary information from the stream
           of e-mails before the e-mails are processed by textual disambiguation.














                                                                                                               174
   169   170   171   172   173   174   175   176   177   178   179