Page 130 - Computational Statistics Handbook with MATLAB

P. 130

Chapter 5: Exploratory Data Analysis 117

If we have data where each observation consists of at least two digits, then
we can construct a stem-and-leaf diagram. To display these, we separate each
measurement into two parts: the stem and the leaf. The stems are comprised
of the leading digit or digits, and the remaining digit makes up the leaf. For
example, if we had the number 75, then the stem is the 7, and the leaf is the 5.
If the number is 203, then the stem is 20 and the leaf is 3.
The stems are listed to the left of a vertical line with all of the leaves corre-
sponding to that stem listed to the right. If the data contain decimal places,
then they can be rounded for easier display. An alternative is to move the dec-
imal place to specify the appropriate leaf unit. We provide a function with the
text that will construct stem-and-leaf plots, and its use is illustrated in the
next example.

Example 5.3
The heights of 32 Tibetan skulls [Hand, et al. 1994; Morant, 1923] measured
in millimeters is given in the file tibetan. These data comprise two groups
of skulls collected in Tibet. One group of 17 skulls comes from graves in Sik-
kim and nearby areas of Tibet and the other 15 skulls come from a battlefield
in Lhasa. The original data contain five measurements, but for this example,
we only use the fourth measurement. This is the upper face height, and we
round to the nearest millimeter. We use the function csstemleaf that is pro-
vided with the text.

load tibetan
% This loads up all 5 measurements of the skulls.
% We use the fourth characteristic to illustrate
% the stem-and-leaf plot. We first round them.
x = round(tibetan(:,4));
csstemleaf(x)
title('Height (mm) of Tibetan Skulls')
The resulting stem-and-leaf is shown in Figure 5.4. From this plot, we see
there is not much evidence that there are two groups of skulls, if we look only
at the characteristic of upper face height. We will explore these data further
in Chapter 9, where we apply pattern recognition methods to the problem.

It is possible that we do not see much evidence for two groups of skulls
because there are too few stems. EDA is an iterative process, where the ana-
lyst should try several visualization methods in search of patterns and infor-
mation in the data. An alternative approach is to plot more than one line per
stem. The function csstemleaf has an optional argument that allows the
user to specify two lines per stem. The default value is one line per stem, as
we saw in Example 5.3. When we plot two lines per stem, leaves that corre-
spond to the digits 0 through 4 are plotted on the first line and those that have
digits 5 through 9 are shown on the second line. A stem-and-leaf with two
lines per stem for the Tibetan skull data is shown in Figure 5.5. In practice,

125 126 127 128 129 130 131 132 133 134 135