Page 375 - Programming Microcontrollers in C
P. 375

360    Chapter 7  Advanced Topics

                              The purpose of this code is to calculate frequency of occurrence
                          of letters in a document and provide some guidance as to how well
                          the compression approach developed works. This program was run
                          with a ten-page instruction manual and then with a telephone book
                          with 200 entries. The results of these two executions are shown below.

                              Char          Frequency Char               Frequency
                              >             30.5977        l             2.1504
                              e             9.1109         p             2.0276
                              t             6.5040         f             1.8257
                              o             5.2664         u             1.2727
                              r             5.1523         y             1.2288
                              i             4.6959         g             1.1762
                              a             4.6169         b             0.8514
                              s             4.5730         w             0.5793
                              n             4.2746         k             0.5617
                              h             3.0194         v             0.2984
                              c             2.8263         x             0.2721
                              d             2.6946         q             0.0351
                              <             2.2031         j             0.0088
                              m             2.1768         z             0.0000
                   There are 11393 characters
                   The theoretical average bits per character is
                   3.797302
                              Output 7-1: Calculation of entropy for the document manual.doc

                              The outputs shown above follow very closely the expected
                          occurrence of letters found in the typical technical text.  The bits per
                          character should be about 4.5, but this value is distorted because the
                          space character is included in the count, and its very frequent
                          occurrences distort the overall averages and hence the entropy per
                          character found in the document.
                              Shown below in Output 2 is a repeat of the same calculation on
                          the contents of a phone book. Note here that occurrences of the letters
                          and other characters are quite different from those found above. Even
                          though the phone book used to create the table below contained only
                          about 200 entries, these data will be used to create a Huffman code
                          to compress the data when storing names into the microcomputer
                          EEPROM.
   370   371   372   373   374   375   376   377   378   379   380