Page 129 -
P. 129
HAN 10-ch03-083-124-9780123814791
92 Chapter 3 Data Preprocessing 2011/6/1 3:16 Page 92 #10
“So, how can we proceed with discrepancy detection?” As a starting point, use any
knowledge you may already have regarding properties of the data. Such knowledge or
“data about data” is referred to as metadata. This is where we can make use of the know-
ledge we gained about our data in Chapter 2. For example, what are the data type and
domain of each attribute? What are the acceptable values for each attribute? The basic
statistical data descriptions discussed in Section 2.2 are useful here to grasp data trends
and identify anomalies. For example, find the mean, median, and mode values. Are the
data symmetric or skewed? What is the range of values? Do all values fall within the
expected range? What is the standard deviation of each attribute? Values that are more
than two standard deviations away from the mean for a given attribute may be flagged
as potential outliers. Are there any known dependencies between attributes? In this step,
you may write your own scripts and/or use some of the tools that we discuss further later.
From this, you may find noise, outliers, and unusual values that need investigation.
As a data analyst, you should be on the lookout for the inconsistent use of codes and
any inconsistent data representations (e.g., “2010/12/25” and “25/12/2010” for date).
Field overloading is another error source that typically results when developers squeeze
new attribute definitions into unused (bit) portions of already defined attributes (e.g.,
an unused bit of an attribute that has a value range that uses only, say, 31 out of
32 bits).
The data should also be examined regarding unique rules, consecutive rules, and null
rules. A unique rule says that each value of the given attribute must be different from
all other values for that attribute. A consecutive rule says that there can be no miss-
ing values between the lowest and highest values for the attribute, and that all values
must also be unique (e.g., as in check numbers). A null rule specifies the use of blanks,
question marks, special characters, or other strings that may indicate the null condition
(e.g., where a value for a given attribute is not available), and how such values should
be handled. As mentioned in Section 3.2.1, reasons for missing values may include
(1) the person originally asked to provide a value for the attribute refuses and/or finds
that the information requested is not applicable (e.g., a license number attribute left
blank by nondrivers); (2) the data entry person does not know the correct value; or (3)
the value is to be provided by a later step of the process. The null rule should specify how
to record the null condition, for example, such as to store zero for numeric attributes, a
blank for character attributes, or any other conventions that may be in use (e.g., entries
like “don’t know” or “?” should be transformed to blank).
There are a number of different commercial tools that can aid in the discrepancy
detection step. Data scrubbing tools use simple domain knowledge (e.g., knowledge
of postal addresses and spell-checking) to detect errors and make corrections in the
data. These tools rely on parsing and fuzzy matching techniques when cleaning data
from multiple sources. Data auditing tools find discrepancies by analyzing the data to
discover rules and relationships, and detecting data that violate such conditions. They
are variants of data mining tools. For example, they may employ statistical analysis to
find correlations, or clustering to identify outliers. They may also use the basic statistical
data descriptions presented in Section 2.2.
Some data inconsistencies may be corrected manually using external references.
For example, errors made at data entry may be corrected by performing a paper