Page 171 - ARM 64 Bit Assembly Language
P. 171

158 Chapter 6

                  6.2 Word frequency counts

                  Counting the frequency of words in written text has several uses. In digital forensics, it can
                  be used to provide evidence as to the author of written communications. Different peo-
                  ple have different vocabularies, and use words with differing frequency. Word counts can
                  also be used to classify documents by type. Scientific articles from different fields contain
                  words specific to the field, and historical novels will differ from western novels in word fre-
                  quency.

                  Listing 6.4 shows the main function for a simple C program which reads a text file and cre-
                  ates a list of all of the words contained in a file, along with their frequency of occurrence.
                  The program has been divided into two parts: the main program, and an ADT. The ADT
                  is used to keep track the words and their frequencies, and to print a table of word frequen-
                  cies.
                                  Listing 6.4 C program to compute word frequencies.

                1  #include <stdlib.h>
                2  #include <string.h>
                3  #include <stdio.h>
                4  #include <ctype.h>
                5  #include <list.h>
                6  /***********************************************************/
                7  /* remove_punctuation copies the input string to a new  */
                8  /* string, but omits any punctuation characters         */
                9  char *remove_punctuation(char *word)
                10  { char* newword = (char*)malloc(strlen(word)+1);
                11  char* curdst = newword;
                12  char* cursrc = word;
                13  while( *cursrc != 0 )
                14    {
                15      if(strchr(",.\"!$();:{}\\[]", *cursrc) == NULL)
                16        { /* Current character is not punctuation */
                17          *curdst = tolower(*cursrc);
                18          curdst++;
                19        }
                20      cursrc++;
                21    }
                22  *curdst=0;
                23  return newword;
                24  }
                25
                26  /***********************************************************/
                27  /* The main function reads whitespace separated words  */
                28  /* from stdin, removes punctuation, and generates a word  */
                29  /* frequency list.                                     */
                30  int main()
   166   167   168   169   170   171   172   173   174   175   176