Page 89 -
P. 89

HAN 09-ch02-039-082-9780123814791


          52    Chapter 2 Getting to Know Your Data          2011/6/1  3:15  Page 52  #14



                         to assess both the overall behavior and unusual occurrences). Second, it plots quantile
                         information (see Section 2.2.2). Let x i , for i = 1 to N, be the data sorted in increasing
                         order so that x 1 is the smallest observation and x N is the largest for some ordinal or
                         numeric attribute X. Each observation, x i , is paired with a percentage, f i , which indicates
                         that approximately f i × 100% of the data are below the value, x i . We say “approximately”
                         because there may not be a value with exactly a fraction, f i , of the data below x i . Note
                         that the 0.25 percentile corresponds to quartile Q 1 , the 0.50 percentile is the median,
                         and the 0.75 percentile is Q 3 .
                           Let
                                                          i − 0.5
                                                      f i =     .                         (2.7)
                                                            N
                                                                             1
                         These numbers increase in equal steps of 1/N, ranging from  2N  (which is slightly
                         above 0) to 1 −  1  (which is slightly below 1). On a quantile plot, x i is graphed against
                                     2N
                         f i . This allows us to compare different distributions based on their quantiles. For exam-
                         ple, given the quantile plots of sales data for two different time periods, we can compare
                         their Q 1 , median, Q 3 , and other f i values at a glance.

           Example 2.13 Quantile plot. Figure 2.4 shows a quantile plot for the unit price data of Table 2.1.

                         Quantile–Quantile Plot

                         A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution
                         against the corresponding quantiles of another. It is a powerful visualization tool in that it
                         allows the user to view whether there is a shift in going from one distribution to another.
                           Suppose that we have two sets of observations for the attribute or variable unit price,
                         taken from two different branch locations. Let x 1 ,...,x N be the data from the first
                         branch, and y 1 ,...,y M be the data from the second, where each data set is sorted in
                         increasing order. If M = N (i.e., the number of points in each set is the same), then we
                         simply plot y i against x i , where y i and x i are both (i − 0.5)/N quantiles of their respec-
                         tive data sets. If M < N (i.e., the second branch has fewer observations than the first),
                         there can be only M points on the q-q plot. Here, y i is the (i − 0.5)/M quantile of the y


                           140
                           120
                                                       Q 3
                          Unit price ($)  80  Q 1
                           100
                                            Median
                            60
                            40
                            20
                             0
                             0.00     0.25     0.50     0.75     1.00
                                              f-value


               Figure 2.4 A quantile plot for the unit price data of Table 2.1.
   84   85   86   87   88   89   90   91   92   93   94