Page 86 -
P. 86

3:15
                                                                                   #11
                                                             2011/6/1
                                                                           Page 49
                          HAN 09-ch02-039-082-9780123814791
                                                              2.2 Basic Statistical Descriptions of Data  49


                                 The quartiles give an indication of a distribution’s center, spread, and shape. The first
                               quartile, denoted by Q 1 , is the 25th percentile. It cuts off the lowest 25% of the data.
                               The third quartile, denoted by Q 3 , is the 75th percentile—it cuts off the lowest 75% (or
                               highest 25%) of the data. The second quartile is the 50th percentile. As the median, it
                               gives the center of the data distribution.
                                 The distance between the first and third quartiles is a simple measure of spread
                               that gives the range covered by the middle half of the data. This distance is called the
                               interquartile range (IQR) and is defined as

                                                           IQR = Q 3 − Q 1 .                    (2.5)

                 Example 2.10 Interquartile range. The quartiles are the three values that split the sorted data set into
                               four equal parts. The data of Example 2.6 contain 12 observations, already sorted in
                               increasing order. Thus, the quartiles for this data are the third, sixth, and ninth val-
                               ues, respectively, in the sorted list. Therefore, Q 1 = $47,000 and Q 3 is $63,000. Thus,
                               the interquartile range is IQR = 63 − 47 = $16,000. (Note that the sixth value is a
                               median, $52,000, although this data set has two medians since the number of data values
                               is even.)



                               Five-Number Summary, Boxplots, and Outliers
                               No single numeric measure of spread (e.g., IQR) is very useful for describing skewed
                               distributions. Have a look at the symmetric and skewed data distributions of Figure 2.1.
                               In the symmetric distribution, the median (and other measures of central tendency)
                               splits the data into equal-size halves. This does not occur for skewed distributions.
                               Therefore, it is more informative to also provide the two quartiles Q 1 and Q 3 , along
                               with the median. A common rule of thumb for identifying suspected outliers is to
                               single out values falling at least 1.5 × IQR above the third quartile or below the first
                               quartile.
                                 Because Q 1 , the median, and Q 3 together contain no information about the end-
                               points (e.g., tails) of the data, a fuller summary of the shape of a distribution can be
                               obtained by providing the lowest and highest data values as well. This is known as
                               the five-number summary. The five-number summary of a distribution consists of the
                               median (Q 2 ), the quartiles Q 1 and Q 3 , and the smallest and largest individual obser-
                               vations, written in the order of Minimum, Q 1 , Median, Q 3 , Maximum.
                                 Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the
                               five-number summary as follows:


                                 Typically, the ends of the box are at the quartiles so that the box length is the
                                 interquartile range.
                                 The median is marked by a line within the box.
                                 Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
                                 largest (Maximum) observations.
   81   82   83   84   85   86   87   88   89   90   91