Page 361 - Computational Statistics Handbook with MATLAB
P. 361

350                        Computational Statistics Handbook with MATLAB

                             p ω j t(  )   given by Equation 9.14. So, using Bayes decision theory, we would
                                                                         that has the highest poste-
                             classify an observation at node t with the class ω j
                             rior probability. The error in our classification is then given by Equation 9.15.
                             We summarize the steps for growing a classification tree in the following pro-
                             cedure. In the learning set, each observation will be a row in the matrix X, so
                             this matrix has dimensionality n ×  ( d +  1)  , representing d features and a class
                             label. The measured value of the k-th feature for the i-th observation is
                                          .
                             denoted by x ik

                             PROCEDURE - GROWING A TREE

                                                                                    that will be
                                1. Determine the maximum number of observations  n max
                                   allowed in a terminal node.
                                                                                       . These
                                2. Determine the prior probabilities of class membership  π j
                                   can be estimated from the data (Equation 9.11), or they can be based
                                   on prior knowledge of the application.
                                3. If a terminal node in the current tree contains more than the max-
                                   imum allowed observations and contains observations from sev-
                                   eral classes, then search for the best split. For each feature k,
                                                                                            .
                                   a. Put the  x ik   in ascending order to give the ordered values  x i()k
                                                           in the k-th feature using
                                   b. Determine all splits  s i()k
                                                  s i()k =  x i()k + ( x i()k –  x i +(  1)k ) 2⁄

                                   c. For each proposed split, evaluate the impurity function  it()  and
                                     the goodness of the split using Equations 9.20 and 9.18.
                                   d. Pick the best, which is the one that yields the largest decrease
                                     in impurity.
                                4. Out of the k best splits in step 3, split the node on the variable that
                                   yields the best overall split.
                                5. For that split found in step 4, determine the observations that go
                                   to the left child and those that go to the right child.
                                6. Repeat steps 3 through 5 until each terminal node satisfies the
                                   stopping rule (has  observations from only one class or has the
                                   maximum allowed cases in the node).


                             Example 9.11
                             In this example, we grow the initial large tree on the data set given in the pre-
                             vious example. We stop growing the tree when each terminal node has a
                             maximum of 5 observations or the node is pure. We first load the data that we
                             generated in the previous example. This file contains the data matrix, the
                             inputs to the function csgrowc, and the resulting tree.

                            © 2002 by Chapman & Hall/CRC
   356   357   358   359   360   361   362   363   364   365   366