Page 159 -
P. 159

10-ch03-083-124-9780123814791
                         HAN

          122   Chapter 3 Data Preprocessing                2011/6/1  3:16 Page 122  #40



                     3.8 Using the data for age and body fat given in Exercise 2.4, answer the following:
                         (a) Normalize the two attributes based on z-score normalization.
                        (b) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are
                            these two attributes positively or negatively correlated? Compute their covariance.
                     3.9 Suppose a group of 12 sales price records has been sorted as follows:
                                           5,10,11,13,15,35,50,55,72,92,204,215.
                         Partition them into three bins by each of the following methods:
                         (a) equal-frequency (equal-depth) partitioning
                        (b) equal-width partitioning
                         (c) clustering
                    3.10 Use a flowchart to summarize the following procedures for attribute subset selection:
                         (a) stepwise forward selection
                        (b) stepwise backward elimination
                         (c) a combination of forward selection and backward elimination
                    3.11 Using the data for age given in Exercise 3.3,
                         (a) Plot an equal-width histogram of width 10.
                        (b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR,
                            cluster sampling, and stratified sampling. Use samples of size 5 and the strata
                            “youth,” “middle-aged,” and “senior.”

                    3.12 ChiMerge [Ker92] is a supervised, bottom-up (i.e., merge-based) data discretization
                                                                              2
                                          2
                         method. It relies on χ analysis: Adjacent intervals with the least χ values are merged
                         together until the chosen stopping criterion satisfies.
                         (a) Briefly describe how ChiMerge works.
                        (b) Take the IRIS data set, obtained from the University of California–Irvine Machine
                            Learning Data Repository (www.ics.uci.edu/∼mlearn/MLRepository.html), as a data
                            set to be discretized. Perform data discretization for each of the four numeric
                            attributes using the ChiMerge method. (Let the stopping criteria be: max-interval
                            = 6). You need to write a small program to do this to avoid clumsy numerical
                            computation. Submit your simple analysis and your test results: split-points, final
                            intervals, and the documented source program.
                    3.13 Propose an algorithm, in pseudocode or in your favorite programming language, for the
                         following:
                         (a) The automatic generation of a concept hierarchy for nominal data based on the
                            number of distinct values of attributes in the given schema.
                        (b) The automatic generation of a concept hierarchy for numeric data based on the
                            equal-width partitioning rule.
   154   155   156   157   158   159   160   161   162   163   164