Page 132 -

P. 132

#13
Page 95
3:16
2011/6/1
HAN 10-ch03-083-124-9780123814791
3.3 Data Integration 95

2
χ Correlation Test for Nominal Data
For nominal data, a correlation relationship between two attributes, A and B, can be
2
discovered by a χ (chi-square) test. Suppose A has c distinct values, namely a 1 ,a 2 ,...a c .
B has r distinct values, namely b 1 ,b 2 ,...b r . The data tuples described by A and B can be
shown as a contingency table, with the c values of A making up the columns and the r
values of B making up the rows. Let (A i ,B j ) denote the joint event that attribute A takes
on value a i and attribute B takes on value b j , that is, where (A = a i ,B = b j ). Each and
2
every possible (A i ,B j ) joint event has its own cell (or slot) in the table. The χ value
2
(also known as the Pearson χ statistic) is computed as
c r 2
2 XX (o ij − e ij )
χ = , (3.1)
e ij
i=1 j=1
where o ij is the observed frequency (i.e., actual count) of the joint event (A i ,B j ) and e ij is
the expected frequency of (A i ,B j ), which can be computed as
count(A = a i ) × count(B = b j )
e ij = , (3.2)
n
where n is the number of data tuples, count(A = a i ) is the number of tuples having value
a i for A, and count(B = b j ) is the number of tuples having value b j for B. The sum in
Eq. (3.1) is computed over all of the r × c cells. Note that the cells that contribute the
2
most to the χ value are those for which the actual count is very different from that
expected.
2
The χ statistic tests the hypothesis that A and B are independent, that is, there is no
correlation between them. The test is based on a signiﬁcance level, with (r − 1) × (c − 1)
degrees of freedom. We illustrate the use of this statistic in Example 3.1. If the hypothesis
can be rejected, then we say that A and B are statistically correlated.

2
Example 3.1 Correlation analysis of nominal attributes using χ . Suppose that a group of 1500
people was surveyed. The gender of each person was noted. Each person was polled as
to whether his or her preferred type of reading material was ﬁction or nonﬁction. Thus,
we have two attributes, gender and preferred reading. The observed frequency (or count)
of each possible joint event is summarized in the contingency table shown in Table 3.1,
where the numbers in parentheses are the expected frequencies. The expected frequen-
cies are calculated based on the data distribution for both attributes using Eq. (3.2).
Using Eq. (3.2), we can verify the expected frequencies for each cell. For example,
the expected frequency for the cell (male, ﬁction) is

count(male) × count(ﬁction) 300 × 450
e 11 = = = 90,
n 1500
and so on. Notice that in any row, the sum of the expected frequencies must equal the
total observed frequency for that row, and the sum of the expected frequencies in any
column must also equal the total observed frequency for that column.

127 128 129 130 131 132 133 134 135 136 137