Page 130 - Computational Statistics Handbook with MATLAB
P. 130

Chapter 5: Exploratory Data Analysis                            117


                              If we have data where each observation consists of at least two digits, then
                             we can construct a stem-and-leaf diagram. To display these, we separate each
                             measurement into two parts: the stem and the leaf. The stems are comprised
                             of the leading digit or digits, and the remaining digit makes up the leaf. For
                             example, if we had the number 75, then the stem is the 7, and the leaf is the 5.
                             If the number is 203, then the stem is 20 and the leaf is 3.
                              The stems are listed to the left of a vertical line with all of the leaves corre-
                             sponding to that stem listed to the right. If the data contain decimal places,
                             then they can be rounded for easier display. An alternative is to move the dec-
                             imal place to specify the appropriate leaf unit. We provide a function with the
                             text that will construct stem-and-leaf plots, and its use is illustrated in the
                             next example.


                             Example 5.3
                             The heights of 32 Tibetan skulls [Hand, et al. 1994; Morant, 1923] measured
                             in millimeters is given in the file tibetan. These data comprise two groups
                             of skulls collected in Tibet. One group of 17 skulls comes from graves in Sik-
                             kim and nearby areas of Tibet and the other 15 skulls come from a battlefield
                             in Lhasa. The original data contain five measurements, but for this example,
                             we only use the fourth measurement. This is the upper face height, and we
                             round to the nearest millimeter. We use the function csstemleaf that is pro-
                             vided with the text.

                                load tibetan
                                % This loads up all 5 measurements of the skulls.
                                % We use the fourth characteristic to illustrate
                                % the stem-and-leaf plot. We first round them.
                                x = round(tibetan(:,4));
                                csstemleaf(x)
                                title('Height (mm) of Tibetan Skulls')
                             The resulting stem-and-leaf is shown in Figure 5.4. From this plot, we see
                             there is not much evidence that there are two groups of skulls, if we look only
                             at the characteristic of upper face height. We will explore these data further
                             in Chapter 9, where we apply pattern recognition methods to the problem.


                              It is possible that we do not see much evidence for two groups of skulls
                             because there are too few stems. EDA is an iterative process, where the ana-
                             lyst should try several visualization methods in search of patterns and infor-
                             mation in the data. An alternative approach is to plot more than one line per
                             stem. The function csstemleaf has an optional argument that allows the
                             user to specify two lines per stem. The default value is one line per stem, as
                             we saw in Example 5.3. When we plot two lines per stem, leaves that corre-
                             spond to the digits 0 through 4 are plotted on the first line and those that have
                             digits 5 through 9 are shown on the second line. A stem-and-leaf with two
                             lines per stem for the Tibetan skull data is shown in Figure 5.5. In practice,


                            © 2002 by Chapman & Hall/CRC
   125   126   127   128   129   130   131   132   133   134   135