Page 363 - Computational Statistics Handbook with MATLAB
P. 363

352                        Computational Statistics Handbook with MATLAB



                                         e
                                         reee
                              uning
                              uning
                             Pr PPrr uningthethe  Tr TTrr eeee
                             PruningthetheT
                             Recall that the classification error for a node is given by Equation 9.15. If we
                             grow a tree until each terminal node contains observations from only one
                             class, then the error rate will be zero. Therefore, if we use the classification
                             error as a stopping criterion or as a measure of when we have a good tree,
                             then we would grow the tree until there are pure nodes. However, as we men-
                             tioned before, this procedure over fits the data and the classification tree will
                             not generalize well to new patterns. The suggestion made in Breiman, et al.
                                                                              , and then to find a
                             [1984] is to grow an overly large tree, denoted by  T max
                             nested sequence of subtrees by successively pruning branches of the tree. The
                             best tree from this sequence is chosen based on the misclassification rate esti-
                             mated by cross-validation or an independent test sample. We describe the
                             two approaches after we discuss how to prune the tree.
                              The pruning procedure uses the misclassification rates along with a cost for
                             the complexity of the tree. The complexity of the tree is based on the number
                             of terminal nodes in a subtree or branch. The cost complexity measure is
                             defined as

                                                                 )
                                               R α T() =  RT() +  α T ;  α ≥  . 0          (9.21)

                             We look for a tree that minimizes the cost complexity given by Equation 9.21.
                                 α
                             The   is a parameter that represents the complexity cost per terminal node.
                             If we have a large tree where every terminal node contains observations from
                             only one class, then RT()   will be zero. However, there will be a penalty paid
                             because of the complexity, and the cost complexity measure becomes

                                                                   )
                                                        R α T() =  α T  .
                             If  α   is small, then the penalty for having a complex tree is small, and the
                             resulting tree is large. The tree that minimizes  R α T()   will tend to have few
                                           α
                             nodes and large  .
                              Before we go further with our explanation of the pruning procedure, we
                                                                                          of a tree
                             need to define what we mean by the branches of a tree. A branch T t
                             T consists of the node   and all its descendent nodes. When we prune or
                                                  t
                             delete this branch, then we remove all descendent nodes of  , leaving the
                                                                                    t
                                            t
                             branch root node  . For example, using the tree in Figure 9.10, the branch cor-
                             responding to node 3 contains nodes 3, 4, 5, 6, and 7, as shown in Figure 9.13.
                             If we delete that branch, then the remaining nodes are 1, 2, and 3.
                              Minimal complexity pruning searches for the branches that have the weak-
                             est link, which we then delete from the tree. The pruning process produces a
                             sequence of subtrees with fewer terminal nodes and decreasing complexity.
                                                                                        . We are
                              We start with our overly large tree and denote this tree as  T max
                             searching for a finite sequence of subtrees such that

                            © 2002 by Chapman & Hall/CRC
   358   359   360   361   362   363   364   365   366   367   368