Page 130 -
P. 130

2011/6/1
                                                                           Page 93
                                                                     3:16
                          HAN 10-ch03-083-124-9780123814791
                                                                                   #11
                                                                             3.3 Data Integration  93


                               trace. Most errors, however, will require data transformations. That is, once we find
                               discrepancies, we typically need to define and apply (a series of) transformations to
                               correct them.
                                 Commercial tools can assist in the data transformation step. Data migration tools
                               allow simple transformations to be specified such as to replace the string “gender” by
                               “sex.” ETL (extraction/transformation/loading) tools allow users to specify transforms
                               through a graphical user interface (GUI). These tools typically support only a restricted
                               set of transforms so that, often, we may also choose to write custom scripts for this step
                               of the data cleaning process.
                                 The two-step process of discrepancy detection and data transformation (to correct
                               discrepancies) iterates. This process, however, is error-prone and time consuming. Some
                               transformations may introduce more discrepancies. Some nested discrepancies may only
                               be detected after others have been fixed. For example, a typo such as “20010” in a year
                               field may only surface once all date values have been converted to a uniform format.
                               Transformations are often done as a batch process while the user waits without feedback.
                               Only after the transformation is complete can the user go back and check that no new
                               anomalies have been mistakenly created. Typically, numerous iterations are required
                               before the user is satisfied. Any tuples that cannot be automatically handled by a given
                               transformation are typically written to a file without any explanation regarding the rea-
                               soning behind their failure. As a result, the entire data cleaning process also suffers from
                               a lack of interactivity.
                                 New approaches to data cleaning emphasize increased interactivity. Potter’s Wheel,
                               for example, is a publicly available data cleaning tool that integrates discrepancy detec-
                               tion and transformation. Users gradually build a series of transformations by composing
                               and debugging individual transformations, one step at a time, on a spreadsheet-like
                               interface. The transformations can be specified graphically or by providing examples.
                               Results are shown immediately on the records that are visible on the screen. The user
                               can choose to undo the transformations, so that transformations that introduced addi-
                               tional errors can be “erased.” The tool automatically performs discrepancy checking in
                               the background on the latest transformed view of the data. Users can gradually develop
                               and refine transformations as discrepancies are found, leading to more effective and
                               efficient data cleaning.
                                 Another approach to increased interactivity in data cleaning is the development of
                               declarative languages for the specification of data transformation operators. Such work
                               focuses on defining powerful extensions to SQL and algorithms that enable users to
                               express data cleaning specifications efficiently.
                                 As we discover more about the data, it is important to keep updating the metadata
                               to reflect this knowledge. This will help speed up data cleaning on future versions of the
                               same data store.

                       3.3     Data Integration


                               Data mining often requires data integration—the merging of data from multiple data
                               stores. Careful integration can help reduce and avoid redundancies and inconsistencies
   125   126   127   128   129   130   131   132   133   134   135