Page 375 - Programming Microcontrollers in C
P. 375
360 Chapter 7 Advanced Topics
The purpose of this code is to calculate frequency of occurrence
of letters in a document and provide some guidance as to how well
the compression approach developed works. This program was run
with a ten-page instruction manual and then with a telephone book
with 200 entries. The results of these two executions are shown below.
Char Frequency Char Frequency
> 30.5977 l 2.1504
e 9.1109 p 2.0276
t 6.5040 f 1.8257
o 5.2664 u 1.2727
r 5.1523 y 1.2288
i 4.6959 g 1.1762
a 4.6169 b 0.8514
s 4.5730 w 0.5793
n 4.2746 k 0.5617
h 3.0194 v 0.2984
c 2.8263 x 0.2721
d 2.6946 q 0.0351
< 2.2031 j 0.0088
m 2.1768 z 0.0000
There are 11393 characters
The theoretical average bits per character is
3.797302
Output 7-1: Calculation of entropy for the document manual.doc
The outputs shown above follow very closely the expected
occurrence of letters found in the typical technical text. The bits per
character should be about 4.5, but this value is distorted because the
space character is included in the count, and its very frequent
occurrences distort the overall averages and hence the entropy per
character found in the document.
Shown below in Output 2 is a repeat of the same calculation on
the contents of a phone book. Note here that occurrences of the letters
and other characters are quite different from those found above. Even
though the phone book used to create the table below contained only
about 200 entries, these data will be used to create a Huffman code
to compress the data when storing names into the microcomputer
EEPROM.