Page 125 -
P. 125

HAN
                                10-ch03-083-124-9780123814791

          88    Chapter 3 Data Preprocessing                 2011/6/1  3:16 Page 88  #6


                 3.2     Data Cleaning


                         Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
                         cleansing) routines attempt to fill in missing values, smooth out noise while identi-
                         fying outliers, and correct inconsistencies in the data. In this section, you will study
                         basic methods for data cleaning. Section 3.2.1 looks at ways of handling missing values.
                         Section 3.2.2 explains data smoothing techniques. Section 3.2.3 discusses approaches to
                         data cleaning as a process.

                   3.2.1 Missing Values

                         Imagine that you need to analyze AllElectronics sales and customer data. You note that
                         many tuples have no recorded value for several attributes such as customer income. How
                         can you go about filling in the missing values for this attribute? Let’s look at the following
                         methods.

                         1. Ignore the tuple: This is usually done when the class label is missing (assuming the
                           mining task involves classification). This method is not very effective, unless the tuple
                           contains several attributes with missing values. It is especially poor when the percent-
                           age of missing values per attribute varies considerably. By ignoring the tuple, we do
                           not make use of the remaining attributes’ values in the tuple. Such data could have
                           been useful to the task at hand.

                         2. Fill in the missing value manually: In general, this approach is time consuming and
                           may not be feasible given a large data set with many missing values.
                         3. Use a global constant to fill in the missing value: Replace all missing attribute values
                           by the same constant such as a label like “Unknown” or −∞. If missing values are
                           replaced by, say, “Unknown,” then the mining program may mistakenly think that
                           they form an interesting concept, since they all have a value in common—that of
                           “Unknown.” Hence, although this method is simple, it is not foolproof.
                         4. Use a measure of central tendency for the attribute (e.g., the mean or median) to
                           fill in the missing value: Chapter 2 discussed measures of central tendency, which
                           indicate the “middle” value of a data distribution. For normal (symmetric) data dis-
                           tributions, the mean can be used, while skewed data distribution should employ
                           the median (Section 2.2). For example, suppose that the data distribution regard-
                           ing the income of AllElectronics customers is symmetric and that the mean income is
                           $56,000. Use this value to replace the missing value for income.
                         5. Use the attribute mean or median for all samples belonging to the same class as
                           the given tuple: For example, if classifying customers according to credit risk, we
                           may replace the missing value with the mean income value for customers in the same
                           credit risk category as that of the given tuple. If the data distribution for a given class
                           is skewed, the median value is a better choice.
                         6. Use the most probable value to fill in the missing value: This may be determined
                           with regression, inference-based tools using a Bayesian formalism, or decision tree
   120   121   122   123   124   125   126   127   128   129   130