Page 83 -
P. 83

09-ch02-039-082-9780123814791
                          HAN

          46    Chapter 2 Getting to Know Your Data          2011/6/1  3:15 Page 46  #8



                           Although the mean is the singlemost useful quantity for describing a data set, it is not
                         always the best way of measuring the center of the data. A major problem with the mean
                         is its sensitivity to extreme (e.g., outlier) values. Even a small number of extreme values
                         can corrupt the mean. For example, the mean salary at a company may be substantially
                         pushed up by that of a few highly paid managers. Similarly, the mean score of a class in
                         an exam could be pulled down quite a bit by a few very low scores. To offset the effect
                         caused by a small number of extreme values, we can instead use the trimmed mean,
                         which is the mean obtained after chopping off values at the high and low extremes. For
                         example, we can sort the values observed for salary and remove the top and bottom 2%
                         before computing the mean. We should avoid trimming too large a portion (such as
                         20%) at both ends, as this can result in the loss of valuable information.
                           For skewed (asymmetric) data, a better measure of the center of data is the median,
                         which is the middle value in a set of ordered data values. It is the value that separates the
                         higher half of a data set from the lower half.
                           In probability and statistics, the median generally applies to numeric data; however,
                         we may extend the concept to ordinal data. Suppose that a given data set of N values
                         for an attribute X is sorted in increasing order. If N is odd, then the median is the
                         middle value of the ordered set. If N is even, then the median is not unique; it is the two
                         middlemost values and any value in between. If X is a numeric attribute in this case, by
                         convention, the median is taken as the average of the two middlemost values.


            Example 2.7 Median. Let’s find the median of the data from Example 2.6. The data are already sorted
                         in increasing order. There is an even number of observations (i.e., 12); therefore, the
                         median is not unique. It can be any value within the two middlemost values of 52 and
                         56 (that is, within the sixth and seventh values in the list). By convention, we assign the
                         average of the two middlemost values as the median; that is,  52+56  =  108  = 54. Thus,
                                                                            2     2
                         the median is $54,000.
                           Suppose that we had only the first 11 values in the list. Given an odd number of
                         values, the median is the middlemost value. This is the sixth value in this list, which has
                         a value of $52,000.
                           The median is expensive to compute when we have a large number of observations.
                         For numeric attributes, however, we can easily approximate the value. Assume that data
                         are grouped in intervals according to their x i data values and that the frequency (i.e.,
                         number of data values) of each interval is known. For example, employees may be
                         grouped according to their annual salary in intervals such as $10–20,000, $20–30,000,
                         and so on. Let the interval that contains the median frequency be the median inter-
                         val. We can approximate the median of the entire data set (e.g., the median salary) by
                         interpolation using the formula
                                                                      !
                                                               P
                                                        N/2 −    freq
                                          median = L 1 +             l  width,            (2.3)
                                                           freq
                                                              median
                         where L 1 is the lower boundary of the median interval, N is the number of values in

                                         P
                         the entire data set,  freq is the sum of the frequencies of all of the intervals that are
                                               l
   78   79   80   81   82   83   84   85   86   87   88