Page 131 -
P. 131

HAN 10-ch03-083-124-9780123814791


          94    Chapter 3 Data Preprocessing                 2011/6/1  3:16  Page 94  #12



                         in the resulting data set. This can help improve the accuracy and speed of the subsequent
                         data mining process.
                           The semantic heterogeneity and structure of data pose great challenges in data inte-
                         gration. How can we match schema and objects from different sources? This is the
                         essence of the entity identification problem, described in Section 3.3.1. Are any attributes
                         correlated? Section 3.3.2 presents correlation tests for numeric and nominal data. Tuple
                         duplication is described in Section 3.3.3. Finally, Section 3.3.4 touches on the detection
                         and resolution of data value conflicts.


                   3.3.1 Entity Identification Problem
                         It is likely that your data analysis task will involve data integration, which combines data
                         from multiple sources into a coherent data store, as in data warehousing. These sources
                         may include multiple databases, data cubes, or flat files.
                           There are a number of issues to consider during data integration. Schema integration
                         and object matching can be tricky. How can equivalent real-world entities from multiple
                         data sources be matched up? This is referred to as the entity identification problem.
                         For example, how can the data analyst or the computer be sure that customer id in one
                         database and cust number in another refer to the same attribute? Examples of metadata
                         for each attribute include the name, meaning, data type, and range of values permitted
                         for the attribute, and null rules for handling blank, zero, or null values (Section 3.2).
                         Such metadata can be used to help avoid errors in schema integration. The metadata
                         may also be used to help transform the data (e.g., where data codes for pay type in one
                         database may be “H” and “S” but 1 and 2 in another). Hence, this step also relates to
                         data cleaning, as described earlier.
                           When matching attributes from one database to another during integration, special
                         attention must be paid to the structure of the data. This is to ensure that any attribute
                         functional dependencies and referential constraints in the source system match those in
                         the target system. For example, in one system, a discount may be applied to the order,
                         whereas in another system it is applied to each individual line item within the order.
                         If this is not caught before integration, items in the target system may be improperly
                         discounted.


                   3.3.2 Redundancy and Correlation Analysis
                         Redundancy is another important issue in data integration. An attribute (such as annual
                         revenue, for instance) may be redundant if it can be “derived” from another attribute
                         or set of attributes. Inconsistencies in attribute or dimension naming can also cause
                         redundancies in the resulting data set.
                           Some redundancies can be detected by correlation analysis. Given two attributes,
                         such analysis can measure how strongly one attribute implies the other, based on the
                                                             2
                         available data. For nominal data, we use the χ (chi-square) test. For numeric attributes,
                         we can use the correlation coefficient and covariance, both of which access how one
                         attribute’s values vary from those of another.
   126   127   128   129   130   131   132   133   134   135   136