Page 135 -

P. 135

HAN 10-ch03-083-124-9780123814791

98 Chapter 3 Data Preprocessing 2011/6/1 3:16 Page 98 #16

Table 3.2 Stock Prices for AllElectronics and HighTech
Time point AllElectronics HighTech
t1 6 20
t2 5 10
t3 4 14
t4 3 5
t5 2 5

(e.g., the data follow multivariate normal distributions) does a covariance of 0 imply
independence.

Example 3.2 Covariance analysis of numeric attributes. Consider Table 3.2, which presents a sim-
pliﬁed example of stock prices observed at ﬁve time points for AllElectronics and
HighTech, a high-tech company. If the stocks are affected by the same industry trends,
will their prices rise or fall together?

6 + 5 + 4 + 3 + 2 20
E(AllElectronics) = = = $4
5 5
and
20 + 10 + 14 + 5 + 5 54
E(HighTech) = = = $10.80.
5 5
Thus, using Eq. (3.4), we compute

6 × 20 + 5 × 10 + 4 × 14 + 3 × 5 + 2 × 5
Cov(AllElectroncis,HighTech) = − 4 × 10.80
5
= 50.2 − 43.2 = 7.

Therefore, given the positive covariance we can say that stock prices for both companies
rise together.

Variance is a special case of covariance, where the two attributes are identical (i.e., the
covariance of an attribute with itself). Variance was discussed in Chapter 2.

3.3.3 Tuple Duplication

In addition to detecting redundancies between attributes, duplication should also be
detected at the tuple level (e.g., where there are two or more identical tuples for a given
unique data entry case). The use of denormalized tables (often done to improve per-
formance by avoiding joins) is another source of data redundancy. Inconsistencies often
arise between various duplicates, due to inaccurate data entry or updating some but not
all data occurrences. For example, if a purchase order database contains attributes for

130 131 132 133 134 135 136 137 138 139 140