Page 159 -

P. 159

10-ch03-083-124-9780123814791
HAN

122 Chapter 3 Data Preprocessing 2011/6/1 3:16 Page 122 #40

3.8 Using the data for age and body fat given in Exercise 2.4, answer the following:
(a) Normalize the two attributes based on z-score normalization.
(b) Calculate the correlation coefﬁcient (Pearson’s product moment coefﬁcient). Are
these two attributes positively or negatively correlated? Compute their covariance.
3.9 Suppose a group of 12 sales price records has been sorted as follows:
5,10,11,13,15,35,50,55,72,92,204,215.
Partition them into three bins by each of the following methods:
(a) equal-frequency (equal-depth) partitioning
(b) equal-width partitioning
(c) clustering
3.10 Use a ﬂowchart to summarize the following procedures for attribute subset selection:
(a) stepwise forward selection
(b) stepwise backward elimination
(c) a combination of forward selection and backward elimination
3.11 Using the data for age given in Exercise 3.3,
(a) Plot an equal-width histogram of width 10.
(b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR,
cluster sampling, and stratiﬁed sampling. Use samples of size 5 and the strata
“youth,” “middle-aged,” and “senior.”

3.12 ChiMerge [Ker92] is a supervised, bottom-up (i.e., merge-based) data discretization
2
2
method. It relies on χ analysis: Adjacent intervals with the least χ values are merged
together until the chosen stopping criterion satisﬁes.
(a) Brieﬂy describe how ChiMerge works.
(b) Take the IRIS data set, obtained from the University of California–Irvine Machine
Learning Data Repository (www.ics.uci.edu/∼mlearn/MLRepository.html), as a data
set to be discretized. Perform data discretization for each of the four numeric
attributes using the ChiMerge method. (Let the stopping criteria be: max-interval
= 6). You need to write a small program to do this to avoid clumsy numerical
computation. Submit your simple analysis and your test results: split-points, ﬁnal
intervals, and the documented source program.
3.13 Propose an algorithm, in pseudocode or in your favorite programming language, for the
following:
(a) The automatic generation of a concept hierarchy for nominal data based on the
number of distinct values of attributes in the given schema.
(b) The automatic generation of a concept hierarchy for numeric data based on the
equal-width partitioning rule.

154 155 156 157 158 159 160 161 162 163 164