Page 56 - Cyberculture and New Media
P. 56

Francisco J. Ricardo                  47
                             ______________________________________________________________
                                            Appendix C – The Enron Mail Corpus

                                     In  emails,  inserting  extraneous  text  (e.g.,  news  stories  from  The
                             Associated Press, Reuters) is common, and these had to be removed so that
                             the  true  style  of  email  writing  could  be  examined.  The  manual  distillation
                             process the elimination of all person references as well as titles (which are
                             not part of the body of a text). Incidentally, having controlled for spam or
                             automatically   generated   titles   (e.g.,   “Breaking   News   from
                             ABCNEWS.com”), “RE:”, “FWD:” and repeated entries, the average email
                             title  is  3.56  words  in  length.  500  random  messages  from  the  Enron  email
                             corpus were cleaned, scanned and parsed for style according to the criteria
                             indicated below.

                                 1.  Repeated or extratextual lines were eliminated (those beginning with
                                     “>“);
                                 2.  Reports included in emails were eliminated (e.g., “Energy Executive
                                     Daily”);
                                 3.  Words containing “@”were eliminated as potential emails;
                                 4.  Lines  containing  email  headers  (e.g.,  “From:”,  “To:”,  “cc:”,
                                     “Subject:”, etc.) were eliminated.

                             The original extraction was of 99,241 words, 493,144 characters on 17,229
                             lines, the equivalent of 303 pages of text.

                                                          Notes

                             1
                               One  might suppose the case of outlining software as the clear exception.
                             This  class  of  software  exhibits,  after  all,  the  swift  and  ready  capacity  for
                             promoting, demoting and reordering items, from lines to entire paragraphs. It
                             would thus seem the ideal topic processor were it not that what is moved is
                             only arranged graphically, rather than semantically. The software executes no
                             rules for identifying, relating, or maintaining coherence among the topics in
                             the user’s text.
                             2
                                Tufte,  E.,  The  Cognitive  Style  of  Powerpoint,  Graphics  Press,  Cheshire,
                             Connecticut, 2003.
                             3
                                Byrne,  D.,  E.E.E.I  (Envisioning  Emotional  Epistemological  Information),
                             Steidl Publishing, Göttingen, Germany, 2003.
                             4
                                Janzen-Wilde,  L.,  ‘Oral  and  Literate  Characteristics  of  Facilitated
                             Communication’, Facilitated Communication Digest, 1993/2,1993.
                             5
                                Ferris,  S.  P.,  ‘Writing  Electronically:  The  Effects  of  Computers  on
                             Traditional Writing’, Journal of Electronic Publishing, 8 2002.
   51   52   53   54   55   56   57   58   59   60   61