Page 160 -

P. 160

2011/6/1
3:16 Page 123
10-ch03-083-124-9780123814791
#41
HAN
3.8 Bibliographic Notes 123

(c) The automatic generation of a concept hierarchy for numeric data based on the
equal-frequency partitioning rule.
3.14 Robust data loading poses a challenge in database systems because the input data are
often dirty. In many cases, an input record may miss multiple values; some records
could be contaminated, with some data values out of range or of a different data type
than expected. Work out an automated data cleaning and loading algorithm so that the
erroneous data will be marked and contaminated data will not be mistakenly inserted
into the database during data loading.

3.8 Bibliographic Notes

Data preprocessing is discussed in a number of textbooks, including English [Eng99],
Pyle [Pyl99], Loshin [Los01], Redman [Red01], and Dasu and Johnson [DJ03]. More
speciﬁc references to individual preprocessing techniques are given later.
For discussion regarding data quality, see Redman [Red92]; Wang, Storey, and
Firth [WSF95]; Wand and Wang [WW96]; Ballou and Tayi [BT99]; and Olson [Ols03].
Potter’s Wheel (control.cx.berkely.edu/abc), the interactive data cleaning tool described in
Section 3.2.3, is presented in Raman and Hellerstein [RH01]. An example of the devel-
opment of declarative languages for the speciﬁcation of data transformation operators is
+
given in Galhardas et al. [GFS 01]. The handling of missing attribute values is discussed
in Friedman [Fri77]; Breiman, Friedman, Olshen, and Stone [BFOS84]; and Quinlan
[Qui89]. Hua and Pei [HP07] presented a heuristic approach to cleaning disguised miss-
ing data, where such data are captured when users falsely select default values on forms
(e.g., “January 1” for birthdate) when they do not want to disclose personal information.
A method for the detection of outlier or “garbage” patterns in a handwritten char-
acter database is given in Guyon, Matic, and Vapnik [GMV96]. Binning and data
+
normalization are treated in many texts, including Kennedy et al. [KLV 98], Weiss
and Indurkhya [WI98], and Pyle [Pyl99]. Systems that include attribute (or feature)
construction include BACON by Langley, Simon, Bradshaw, and Zytkow [LSBZ87];
Stagger by Schlimmer [Sch86]; FRINGE by Pagallo [Pag89]; and AQ17-DCI by Bloe-
dorn and Michalski [BM98]. Attribute construction is also described in Liu and Motoda
[LM98a, LM98b]. Dasu et al. built a BELLMAN system and proposed a set of interesting
methods for building a data quality browser by mining database structures [DJMS02].
+
A good survey of data reduction techniques can be found in Barbar´ a et al. [BDF 97].
For algorithms on data cubes and their precomputation, see Sarawagi and Stonebraker
+
[SS94]; Agarwal et al. [AAD 96]; Harinarayan, Rajaraman, and Ullman [HRU96]; Ross
and Srivastava [RS97]; and Zhao, Deshpande, and Naughton [ZDN97]. Attribute sub-
set selection (or feature subset selection) is described in many texts such as Neter, Kutner,
Nachtsheim, and Wasserman [NKNW96]; Dash and Liu [DL97]; and Liu and Motoda
[LM98a, LM98b]. A combination forward selection and backward elimination method

155 156 157 158 159 160 161 162 163 164 165