Page 146 - Computational Statistics Handbook with MATLAB
P. 146
Chapter 5: Exploratory Data Analysis 133
Two limits are also defined: a lower limit (LL) and an upper limit (UL). These
are calculated from the estimated IQR as follows
ˆ
ˆ
⋅
LL = q 0.25 – 1.5 IQR
(5.6)
ˆ
⋅
ˆ
UL = q 0.75 + 1.5 IQR.
The idea is that observations that lie outside these limits are possible outliers.
Outliers are data points that lie away from the rest of the data. This might
mean that the data were incorrectly measured or recorded. On the other
hand, it could mean that they represent extreme points that arise naturally
according to the distribution. In any event, they are sample points that are
suitable for further investigation.
Adjacent values are the most extreme observations in the data set that are
within the lower and the upper limits. If there are no potential outliers, then
the adjacent values are simply the maximum and the minimum data points.
To construct a box plot, we place horizontal lines at each of the three quar-
tiles and draw vertical lines to create a box. We then extend a line from the
first quartile to the smallest adjacent value and do the same for the third quar-
tile and largest adjacent value. These lines are sometimes called the whiskers.
Finally, any possible outliers are shown as an asterisk or some other plotting
symbol. An example of a box plot is shown in Figure 5.14.
Box plots for different samples can be plotted together for visually compar-
ing the corresponding distributions. The MATLAB Statistics Toolbox con-
tains a function called boxplot for creating this type of display. It displays
one box plot for each column of data. When we want to compare data sets, it
is better to display a box plot with notches. These notches represent the
uncertainty in the locations of central tendency and provide a rough measure
of the significance of the differences between the values. If the notches do not
overlap, then there is evidence that the medians are significantly different.
The length of the whisker is easily adjusted using optional input arguments
to boxplot. For more information on this function and to find out what
other options are available, type help boxplot at the MATLAB command
line.
Example 5.10
In this example, we first generate random variables from a uniform distribu-
tion on the interval 01,( ) , a standard normal distribution, and an exponen-
tial distribution. We will then display the box plots corresponding to each
sample using the MATLAB function boxplot.
% Generate a sample from the uniform distribution.
xunif = rand(100,1);
% Generate sample from the standard normal.
xnorm = randn(100,1);
% Generate a sample from the exponential distribution.
© 2002 by Chapman & Hall/CRC