Page 360 - Computational Statistics Handbook with MATLAB
P. 360

Chapter 9: Statistical Pattern Recognition                      349


                              To grow a tree, we need to have some criterion to help us decide how to
                             split the nodes. We also need a rule that will tell us when to stop splitting the
                             nodes, at which point we are finished growing the tree. The stopping rule can
                             be quite simple, since we first grow an overly large tree. One possible choice
                             is to continue splitting terminal nodes until each one contains observations
                             from the same class, in which case some nodes might have only one observa-
                             tion in the node. Another option is to continue splitting nodes until there is
                             some maximum number of observations left in a node or the terminal node
                             is pure (all observations belong to one class). Recommended values for the
                             maximum number of observations left in a terminal node are between 1
                             and 5.
                              We now discuss the splitting rule in more detail. When we split a node, our
                             goal is to find a split that reduces the impurity in some manner. So, we need
                             a measure of impurity i(t) for a node t. Breiman, et al. [1984] discuss several
                             possibilities, one of which is called the Gini diversity index. This is the one
                             we will use in our implementation of classification trees. The Gini index is
                             given by

                                                                   (
                                                    it() =  ∑ p ω i t(  )p ω j t)  ,       (9.19)
                                                          i ≠  j

                             which can also be written as

                                                               J
                                                              ∑ p ω j t)  .                (9.20)
                                                                  (
                                                                 2
                                                     it() =  1 –
                                                              j =  1
                             Equation 9.20 is the one we code in the MATLAB function csgrowc for
                             growing classification trees.
                              Before continuing with our description of the splitting process, we first
                             note that our use of the term ‘best’ does not necessarily mean that the split we
                             find is the optimal one out of all the infinite possible splits. To grow a tree at
                             a given node, we search for the best split (in terms of decreasing the node
                             impurity) by first searching through each variable or feature. We have d pos-
                             sible best splits for a node (one for each feature), and we choose the best one
                             out of these d splits. The problem now is to search through the infinite num-
                             ber of possible splits. We can limit our search by using the following conven-
                             tion. For all feature vectors in our learning sample, we search for the best split
                             at the k-th feature by proposing splits that are halfway between consecutive
                             values for that feature. For each proposed split, we evaluate the impurity cri-
                             terion and choose the split that yields the largest decrease in impurity.
                              Once we have finished growing our tree, we must assign class labels to the
                             terminal nodes and determine the corresponding misclassification rate. It
                             makes sense to assign the class label to a node according to the likelihood that
                                           given that it fell into node t. This is the posterior probability
                             it is in class ω j
                            © 2002 by Chapman & Hall/CRC
   355   356   357   358   359   360   361   362   363   364   365