Page 132 -
P. 132

#13
                                                                           Page 95
                                                                     3:16
                                                             2011/6/1
                          HAN 10-ch03-083-124-9780123814791
                                                                             3.3 Data Integration  95


                                2
                               χ Correlation Test for Nominal Data
                               For nominal data, a correlation relationship between two attributes, A and B, can be
                                            2
                               discovered by a χ (chi-square) test. Suppose A has c distinct values, namely a 1 ,a 2 ,...a c .
                               B has r distinct values, namely b 1 ,b 2 ,...b r . The data tuples described by A and B can be
                               shown as a contingency table, with the c values of A making up the columns and the r
                               values of B making up the rows. Let (A i ,B j ) denote the joint event that attribute A takes
                               on value a i and attribute B takes on value b j , that is, where (A = a i ,B = b j ). Each and
                                                                                              2
                               every possible (A i ,B j ) joint event has its own cell (or slot) in the table. The χ value
                                                      2
                               (also known as the Pearson χ statistic) is computed as
                                                             c  r        2
                                                         2  XX    (o ij − e ij )
                                                       χ =                 ,                    (3.1)
                                                                     e ij
                                                            i=1 j=1
                               where o ij is the observed frequency (i.e., actual count) of the joint event (A i ,B j ) and e ij is
                               the expected frequency of (A i ,B j ), which can be computed as
                                                       count(A = a i ) × count(B = b j )
                                                   e ij =                      ,                (3.2)
                                                                  n
                               where n is the number of data tuples, count(A = a i ) is the number of tuples having value
                               a i for A, and count(B = b j ) is the number of tuples having value b j for B. The sum in
                               Eq. (3.1) is computed over all of the r × c cells. Note that the cells that contribute the
                                          2
                               most to the χ value are those for which the actual count is very different from that
                               expected.
                                      2
                                 The χ statistic tests the hypothesis that A and B are independent, that is, there is no
                               correlation between them. The test is based on a significance level, with (r − 1) × (c − 1)
                               degrees of freedom. We illustrate the use of this statistic in Example 3.1. If the hypothesis
                               can be rejected, then we say that A and B are statistically correlated.

                                                                         2
                  Example 3.1 Correlation analysis of nominal attributes using χ . Suppose that a group of 1500
                               people was surveyed. The gender of each person was noted. Each person was polled as
                               to whether his or her preferred type of reading material was fiction or nonfiction. Thus,
                               we have two attributes, gender and preferred reading. The observed frequency (or count)
                               of each possible joint event is summarized in the contingency table shown in Table 3.1,
                               where the numbers in parentheses are the expected frequencies. The expected frequen-
                               cies are calculated based on the data distribution for both attributes using Eq. (3.2).
                                 Using Eq. (3.2), we can verify the expected frequencies for each cell. For example,
                               the expected frequency for the cell (male, fiction) is

                                                count(male) × count(fiction)  300 × 450
                                           e 11 =                       =          = 90,
                                                           n                1500
                               and so on. Notice that in any row, the sum of the expected frequencies must equal the
                               total observed frequency for that row, and the sum of the expected frequencies in any
                               column must also equal the total observed frequency for that column.
   127   128   129   130   131   132   133   134   135   136   137