Page 88 -
P. 88

3:15
                                                                           Page 51
                                                                                   #13
                                                             2011/6/1
                          HAN 09-ch02-039-082-9780123814791
                                                              2.2 Basic Statistical Descriptions of Data  51


                                 The variance of N observations, x 1 ,x 2 ,...,x N , for a numeric attribute X is
                                                        N               N   !
                                                     1  X       2    1  X  2     2
                                                  2
                                                σ =       (x i − ¯x) =    x i  − ¯x ,           (2.6)
                                                     N               N
                                                       i=1             i=1
                               where ¯x is the mean value of the observations, as defined in Eq. (2.1). The standard
                                                                                      2
                               deviation, σ, of the observations is the square root of the variance, σ .
                 Example 2.12 Variance and standard deviation. In Example 2.6, we found ¯x = $58,000 using Eq. (2.1)
                               for the mean. To determine the variance and standard deviation of the data from that
                               example, we set N = 12 and use Eq. (2.6) to obtain
                                                      1   2    2    2       2     2
                                                  2
                                                σ =    (30 + 36 + 47 ... + 110 ) − 58
                                                     12
                                                   ≈ 379.17
                                                     √
                                                 σ ≈  379.17 ≈ 19.47.

                                 The basic properties of the standard deviation, σ, as a measure of spread are as
                               follows:

                                 σ measures spread about the mean and should be considered only when the mean is
                                 chosen as the measure of center.
                                 σ = 0 only when there is no spread, that is, when all observations have the same
                                 value. Otherwise, σ > 0.

                                 Importantly, an observation is unlikely to be more than several standard deviations
                               away from the mean. Mathematically, using Chebyshev’s inequality, it can be shown that

                               at least 1 −  1 2  × 100% of the observations are no more than k standard deviations
                                         k
                               from the mean. Therefore, the standard deviation is a good indicator of the spread of a
                               data set.
                                 The computation of the variance and standard deviation is scalable in large databases.


                         2.2.3 Graphic Displays of Basic Statistical Descriptions of Data
                               In this section, we study graphic displays of basic statistical descriptions. These include
                               quantile plots, quantile–quantile plots, histograms, and scatter plots. Such graphs are help-
                               ful for the visual inspection of data, which is useful for data preprocessing. The first
                               three of these show univariate distributions (i.e., data for one attribute), while scatter
                               plots show bivariate distributions (i.e., involving two attributes).

                               Quantile Plot
                               In this and the following subsections, we cover common graphic displays of data distri-
                               butions. A quantile plot is a simple and effective way to have a first look at a univariate
                               data distribution. First, it displays all of the data for the given attribute (allowing the user
   83   84   85   86   87   88   89   90   91   92   93