Page 146 - Computational Statistics Handbook with MATLAB
P. 146

Chapter 5: Exploratory Data Analysis                            133


                             Two limits are also defined: a lower limit (LL) and an upper limit (UL). These
                             are calculated from the estimated IQR as follows

                                                                     ˆ
                                                           ˆ
                                                                   ⋅
                                                     LL =  q 0.25 –  1.5 IQR
                                                                                            (5.6)
                                                                     ˆ
                                                                   ⋅
                                                          ˆ
                                                    UL =  q 0.75 +  1.5 IQR.
                             The idea is that observations that lie outside these limits are possible outliers.
                             Outliers are data points that lie away from the rest of the data. This might
                             mean that the data were incorrectly measured or recorded. On the other
                             hand, it could mean that they represent extreme points that arise naturally
                             according to the distribution. In any event, they are sample points that are
                             suitable for further investigation.
                              Adjacent values are the most extreme observations in the data set that are
                             within the lower and the upper limits. If there are no potential outliers, then
                             the adjacent values are simply the maximum and the minimum data points.
                              To construct a box plot, we place horizontal lines at each of the three quar-
                             tiles and draw vertical lines to create a box. We then extend a line from the
                             first quartile to the smallest adjacent value and do the same for the third quar-
                             tile and largest adjacent value. These lines are sometimes called the whiskers.
                             Finally, any possible outliers are shown as an asterisk or some other plotting
                             symbol. An example of a box plot is shown in Figure 5.14.
                              Box plots for different samples can be plotted together for visually compar-
                             ing the corresponding distributions. The MATLAB Statistics Toolbox con-
                             tains a function called boxplot for creating this type of display. It displays
                             one box plot for each column of data. When we want to compare data sets, it
                             is better to display a box plot with notches. These notches represent the
                             uncertainty in the locations of central tendency and provide a rough measure
                             of the significance of the differences between the values. If the notches do not
                             overlap, then there is evidence that the medians are significantly different.
                             The length of the whisker is easily adjusted using optional input arguments
                             to boxplot. For more information on this function and to find out what
                             other options are available, type help boxplot at the MATLAB command
                             line.


                             Example 5.10
                             In this example, we first generate random variables from a uniform distribu-
                             tion on the interval  01,(  )  , a standard normal distribution, and an exponen-
                             tial distribution. We will then display the box plots corresponding to each
                             sample using the MATLAB function boxplot.

                                % Generate a sample from the uniform distribution.
                                xunif = rand(100,1);
                                % Generate sample from the standard normal.
                                xnorm = randn(100,1);
                                % Generate a sample from the exponential distribution.

                            © 2002 by Chapman & Hall/CRC
   141   142   143   144   145   146   147   148   149   150   151