Page 134 -
P. 134

Page 97
                                                                                   #15
                                                             2011/6/1
                          HAN 10-ch03-083-124-9780123814791
                                                                     3:16
                                                                             3.3 Data Integration  97


                               scatter plots respectively show positively correlated data and negatively correlated data,
                               while Figure 2.9 displays uncorrelated data.
                                 Note that correlation does not imply causality. That is, if A and B are correlated, this
                               does not necessarily imply that A causes B or that B causes A. For example, in analyzing a
                               demographic database, we may find that attributes representing the number of hospitals
                               and the number of car thefts in a region are correlated. This does not mean that one
                               causes the other. Both are actually causally linked to a third attribute, namely, population.

                               Covariance of Numeric Data

                               In probability theory and statistics, correlation and covariance are two similar measures
                               for assessing how much two attributes change together. Consider two numeric attributes
                               A and B, and a set of n observations {(a 1 ,b 1 ),...,(a n ,b n )}. The mean values of A and B,
                               respectively, are also known as the expected values on A and B, that is,
                                                                     n  a
                                                                   P
                                                                ¯
                                                         E(A) = A =  i=1 i
                                                                      n
                               and
                                                                     n  b
                                                                   P
                                                         E(B) = ¯ B =  i=1 i  .
                                                                      n
                               The covariance between A and B is defined as
                                                                                ¯
                                                                      P n  (a i − A)(b i − ¯ B)
                                                           ¯
                                          Cov(A,B) = E((A − A)(B − ¯ B)) =  i=1         .       (3.4)
                                                                              n
                                 If we compare Eq. (3.3) for r A,B (correlation coefficient) with Eq. (3.4) for covariance,
                               we see that
                                                                Cov(A,B)
                                                          r A,B =       ,                       (3.5)
                                                                  σ A σ B
                               where σ A and σ B are the standard deviations of A and B, respectively. It can also be
                               shown that
                                                                          ¯
                                                       Cov(A,B) = E(A · B) − A ¯ B.             (3.6)
                               This equation may simplify calculations.
                                                                                               ¯
                                 For two attributes A and B that tend to change together, if A is larger than A (the
                               expected value of A), then B is likely to be larger than ¯ B (the expected value of B).
                               Therefore, the covariance between A and B is positive. On the other hand, if one of
                               the attributes tends to be above its expected value when the other attribute is below its
                               expected value, then the covariance of A and B is negative.
                                 If A and B are independent (i.e., they do not have correlation), then E(A · B) = E(A) ·
                                                                                              ¯
                                                                              ¯
                               E(B). Therefore, the covariance is Cov(A,B) = E(A · B) − A ¯ B = E(A) · E(B) − A ¯ B = 0.
                               However, the converse is not true. Some pairs of random variables (attributes) may have
                               a covariance of 0 but are not independent. Only under some additional assumptions
   129   130   131   132   133   134   135   136   137   138   139