Page 132 - Computational Statistics Handbook with MATLAB
P. 132

Chapter 5: Exploratory Data Analysis                            119


                             one could plot a stem-and-leaf with one and with two lines per stem as a way
                             of discovering more about the data. The stem-and-leaf is useful in that it
                             approximates the shape of the density, and it also provides a listing of the
                             data. One can usually recover the original data set from the stem-and-leaf (if
                             it has not been rounded), unlike the histogram. A disadvantage of the stem-
                             and-leaf plot is that it is not useful for large data sets, while a histogram is
                             very effective in reducing and displaying massive data sets.



                                  ile-Basile-Bas
                               aanntt
                             a
                                        edPlotsdPlots
                                                                ributionribution
                                                 ContinuousContinuous D
                                         d
                             QuQu
                             Qu
                                                                s
                             Qu  annt  tile-Basile-Base  eedPlotsPlots - Continuous  -- -  Continuous  Di DDii isst sstt tributionribution  s ss
                             If we need to compare two distributions, then we can use the quantile plot to
                             visually compare them. This is also applicable when we want to compare a
                             distribution and a sample or to compare two samples. In comparing the dis-
                             tributions or samples, we are interested in knowing how they are shifted rel-
                             ative to each other. In essence, we want to know if they are distributed in the
                             same way. This is important when we are trying to determine the distribution
                             that generated our data, possibly with the goal of using that information to
                             generate data for Monte Carlo simulation. Another application where this is
                             useful is in checking model assumptions, such as normality, before we con-
                             duct our analysis.
                              In this part, we discuss several versions of quantile-based plots. These
                             include quantile-quantile plots (q-q plots) and quantile plots (sometimes
                             called a probability plot). Quantile plots for discrete data are discussed next.
                             The quantile plot is used to compare a sample with a theoretical distribution.
                             Typically, a q-q plot (sometimes called an empirical quantile plot) is used to
                             determine whether two random samples are generated by the same distribu-
                             tion. It should be noted that the q-q plot can also be used to compare a ran-
                             dom sample with a theoretical distribution by generating a sample from the
                             theoretical distribution as the second sample.


                                  t
                                 Plo
                             Q-Q-
                             Q-Q  QQ  PloPlo  t tt
                             Q-
                               QPlo
                             The q-q plot was originally proposed by Wilk and Gnanadesikan [1968] to
                             visually compare two distributions by graphing the quantiles of one versus
                             the quantiles of the other. Say we have two data sets consisting of univariate
                             measurements. We denote the order statistics for the first data set by
                                                          ,   ,  ,
                                                       x 1() x 2() … x n()  .
                             Let the order statistics for the second data set be

                                                              ,
                                                                 ,
                                                          ,
                                                       y 1() y 2() … y m()  ,
                             with m ≤  n .



                            © 2002 by Chapman & Hall/CRC
   127   128   129   130   131   132   133   134   135   136   137