Page 129 -
P. 129

HAN 10-ch03-083-124-9780123814791


          92    Chapter 3 Data Preprocessing                 2011/6/1  3:16  Page 92  #10



                           “So, how can we proceed with discrepancy detection?” As a starting point, use any
                         knowledge you may already have regarding properties of the data. Such knowledge or
                         “data about data” is referred to as metadata. This is where we can make use of the know-
                         ledge we gained about our data in Chapter 2. For example, what are the data type and
                         domain of each attribute? What are the acceptable values for each attribute? The basic
                         statistical data descriptions discussed in Section 2.2 are useful here to grasp data trends
                         and identify anomalies. For example, find the mean, median, and mode values. Are the
                         data symmetric or skewed? What is the range of values? Do all values fall within the
                         expected range? What is the standard deviation of each attribute? Values that are more
                         than two standard deviations away from the mean for a given attribute may be flagged
                         as potential outliers. Are there any known dependencies between attributes? In this step,
                         you may write your own scripts and/or use some of the tools that we discuss further later.
                         From this, you may find noise, outliers, and unusual values that need investigation.
                           As a data analyst, you should be on the lookout for the inconsistent use of codes and
                         any inconsistent data representations (e.g., “2010/12/25” and “25/12/2010” for date).
                         Field overloading is another error source that typically results when developers squeeze
                         new attribute definitions into unused (bit) portions of already defined attributes (e.g.,
                         an unused bit of an attribute that has a value range that uses only, say, 31 out of
                         32 bits).
                           The data should also be examined regarding unique rules, consecutive rules, and null
                         rules. A unique rule says that each value of the given attribute must be different from
                         all other values for that attribute. A consecutive rule says that there can be no miss-
                         ing values between the lowest and highest values for the attribute, and that all values
                         must also be unique (e.g., as in check numbers). A null rule specifies the use of blanks,
                         question marks, special characters, or other strings that may indicate the null condition
                         (e.g., where a value for a given attribute is not available), and how such values should
                         be handled. As mentioned in Section 3.2.1, reasons for missing values may include
                         (1) the person originally asked to provide a value for the attribute refuses and/or finds
                         that the information requested is not applicable (e.g., a license number attribute left
                         blank by nondrivers); (2) the data entry person does not know the correct value; or (3)
                         the value is to be provided by a later step of the process. The null rule should specify how
                         to record the null condition, for example, such as to store zero for numeric attributes, a
                         blank for character attributes, or any other conventions that may be in use (e.g., entries
                         like “don’t know” or “?” should be transformed to blank).
                           There are a number of different commercial tools that can aid in the discrepancy
                         detection step. Data scrubbing tools use simple domain knowledge (e.g., knowledge
                         of postal addresses and spell-checking) to detect errors and make corrections in the
                         data. These tools rely on parsing and fuzzy matching techniques when cleaning data
                         from multiple sources. Data auditing tools find discrepancies by analyzing the data to
                         discover rules and relationships, and detecting data that violate such conditions. They
                         are variants of data mining tools. For example, they may employ statistical analysis to
                         find correlations, or clustering to identify outliers. They may also use the basic statistical
                         data descriptions presented in Section 2.2.
                           Some data inconsistencies may be corrected manually using external references.
                         For example, errors made at data entry may be corrected by performing a paper
   124   125   126   127   128   129   130   131   132   133   134