Page 134 -
P. 134
Page 97
#15
2011/6/1
HAN 10-ch03-083-124-9780123814791
3:16
3.3 Data Integration 97
scatter plots respectively show positively correlated data and negatively correlated data,
while Figure 2.9 displays uncorrelated data.
Note that correlation does not imply causality. That is, if A and B are correlated, this
does not necessarily imply that A causes B or that B causes A. For example, in analyzing a
demographic database, we may find that attributes representing the number of hospitals
and the number of car thefts in a region are correlated. This does not mean that one
causes the other. Both are actually causally linked to a third attribute, namely, population.
Covariance of Numeric Data
In probability theory and statistics, correlation and covariance are two similar measures
for assessing how much two attributes change together. Consider two numeric attributes
A and B, and a set of n observations {(a 1 ,b 1 ),...,(a n ,b n )}. The mean values of A and B,
respectively, are also known as the expected values on A and B, that is,
n a
P
¯
E(A) = A = i=1 i
n
and
n b
P
E(B) = ¯ B = i=1 i .
n
The covariance between A and B is defined as
¯
P n (a i − A)(b i − ¯ B)
¯
Cov(A,B) = E((A − A)(B − ¯ B)) = i=1 . (3.4)
n
If we compare Eq. (3.3) for r A,B (correlation coefficient) with Eq. (3.4) for covariance,
we see that
Cov(A,B)
r A,B = , (3.5)
σ A σ B
where σ A and σ B are the standard deviations of A and B, respectively. It can also be
shown that
¯
Cov(A,B) = E(A · B) − A ¯ B. (3.6)
This equation may simplify calculations.
¯
For two attributes A and B that tend to change together, if A is larger than A (the
expected value of A), then B is likely to be larger than ¯ B (the expected value of B).
Therefore, the covariance between A and B is positive. On the other hand, if one of
the attributes tends to be above its expected value when the other attribute is below its
expected value, then the covariance of A and B is negative.
If A and B are independent (i.e., they do not have correlation), then E(A · B) = E(A) ·
¯
¯
E(B). Therefore, the covariance is Cov(A,B) = E(A · B) − A ¯ B = E(A) · E(B) − A ¯ B = 0.
However, the converse is not true. Some pairs of random variables (attributes) may have
a covariance of 0 but are not independent. Only under some additional assumptions