Page 121 -
P. 121

HAN
                                10-ch03-083-124-9780123814791

          84    Chapter 3 Data Preprocessing                 2011/6/1  3:16 Page 84  #2


                 3.1     Data Preprocessing: An Overview


                         This section presents an overview of data preprocessing. Section 3.1.1 illustrates the
                         many elements defining data quality. This provides the incentive behind data prepro-
                         cessing. Section 3.1.2 outlines the major tasks in data preprocessing.



                   3.1.1 Data Quality: Why Preprocess the Data?

                         Data have quality if they satisfy the requirements of the intended use. There are many
                         factors comprising data quality, including accuracy, completeness, consistency, timeliness,
                         believability, and interpretability.
                           Imagine that you are a manager at AllElectronics and have been charged with ana-
                         lyzing the company’s data with respect to your branch’s sales. You immediately set out
                         to perform this task. You carefully inspect the company’s database and data warehouse,
                         identifying and selecting the attributes or dimensions (e.g., item, price, and units sold)
                         to be included in your analysis. Alas! You notice that several of the attributes for various
                         tuples have no recorded value. For your analysis, you would like to include informa-
                         tion as to whether each item purchased was advertised as on sale, yet you discover that
                         this information has not been recorded. Furthermore, users of your database system
                         have reported errors, unusual values, and inconsistencies in the data recorded for some
                         transactions. In other words, the data you wish to analyze by data mining techniques are
                         incomplete (lacking attribute values or certain attributes of interest, or containing only
                         aggregate data); inaccurate or noisy (containing errors, or values that deviate from the
                         expected); and inconsistent (e.g., containing discrepancies in the department codes used
                         to categorize items). Welcome to the real world!
                           This scenario illustrates three of the elements defining data quality: accuracy, com-
                         pleteness, and consistency. Inaccurate, incomplete, and inconsistent data are common-
                         place properties of large real-world databases and data warehouses. There are many
                         possible reasons for inaccurate data (i.e., having incorrect attribute values). The data col-
                         lection instruments used may be faulty. There may have been human or computer errors
                         occurring at data entry. Users may purposely submit incorrect data values for manda-
                         tory fields when they do not wish to submit personal information (e.g., by choosing
                         the default value “January 1” displayed for birthday). This is known as disguised missing
                         data. Errors in data transmission can also occur. There may be technology limitations
                         such as limited buffer size for coordinating synchronized data transfer and consump-
                         tion. Incorrect data may also result from inconsistencies in naming conventions or data
                         codes, or inconsistent formats for input fields (e.g., date). Duplicate tuples also require
                         data cleaning.
                           Incomplete data can occur for a number of reasons. Attributes of interest may not
                         always be available, such as customer information for sales transaction data. Other data
                         may not be included simply because they were not considered important at the time
                         of entry. Relevant data may not be recorded due to a misunderstanding or because of
                         equipment malfunctions. Data that were inconsistent with other recorded data may
   116   117   118   119   120   121   122   123   124   125   126