Page 160 -
P. 160

2011/6/1
                                                                     3:16 Page 123
                               10-ch03-083-124-9780123814791
                                                                                    #41
                         HAN
                                                                           3.8 Bibliographic Notes  123


                               (c) The automatic generation of a concept hierarchy for numeric data based on the
                                  equal-frequency partitioning rule.
                          3.14 Robust data loading poses a challenge in database systems because the input data are
                               often dirty. In many cases, an input record may miss multiple values; some records
                               could be contaminated, with some data values out of range or of a different data type
                               than expected. Work out an automated data cleaning and loading algorithm so that the
                               erroneous data will be marked and contaminated data will not be mistakenly inserted
                               into the database during data loading.




                       3.8     Bibliographic Notes


                               Data preprocessing is discussed in a number of textbooks, including English [Eng99],
                               Pyle [Pyl99], Loshin [Los01], Redman [Red01], and Dasu and Johnson [DJ03]. More
                               specific references to individual preprocessing techniques are given later.
                                 For discussion regarding data quality, see Redman [Red92]; Wang, Storey, and
                               Firth [WSF95]; Wand and Wang [WW96]; Ballou and Tayi [BT99]; and Olson [Ols03].
                               Potter’s Wheel (control.cx.berkely.edu/abc), the interactive data cleaning tool described in
                               Section 3.2.3, is presented in Raman and Hellerstein [RH01]. An example of the devel-
                               opment of declarative languages for the specification of data transformation operators is
                                                      +
                               given in Galhardas et al. [GFS 01]. The handling of missing attribute values is discussed
                               in Friedman [Fri77]; Breiman, Friedman, Olshen, and Stone [BFOS84]; and Quinlan
                               [Qui89]. Hua and Pei [HP07] presented a heuristic approach to cleaning disguised miss-
                               ing data, where such data are captured when users falsely select default values on forms
                               (e.g., “January 1” for birthdate) when they do not want to disclose personal information.
                                 A method for the detection of outlier or “garbage” patterns in a handwritten char-
                               acter database is given in Guyon, Matic, and Vapnik [GMV96]. Binning and data
                                                                                          +
                               normalization are treated in many texts, including Kennedy et al. [KLV 98], Weiss
                               and Indurkhya [WI98], and Pyle [Pyl99]. Systems that include attribute (or feature)
                               construction include BACON by Langley, Simon, Bradshaw, and Zytkow [LSBZ87];
                               Stagger by Schlimmer [Sch86]; FRINGE by Pagallo [Pag89]; and AQ17-DCI by Bloe-
                               dorn and Michalski [BM98]. Attribute construction is also described in Liu and Motoda
                               [LM98a, LM98b]. Dasu et al. built a BELLMAN system and proposed a set of interesting
                               methods for building a data quality browser by mining database structures [DJMS02].
                                                                                                +
                                 A good survey of data reduction techniques can be found in Barbar´ a et al. [BDF 97].
                               For algorithms on data cubes and their precomputation, see Sarawagi and Stonebraker
                                                     +
                               [SS94]; Agarwal et al. [AAD 96]; Harinarayan, Rajaraman, and Ullman [HRU96]; Ross
                               and Srivastava [RS97]; and Zhao, Deshpande, and Naughton [ZDN97]. Attribute sub-
                               set selection (or feature subset selection) is described in many texts such as Neter, Kutner,
                               Nachtsheim, and Wasserman [NKNW96]; Dash and Liu [DL97]; and Liu and Motoda
                               [LM98a, LM98b]. A combination forward selection and backward elimination method
   155   156   157   158   159   160   161   162   163   164   165