Page 360 - Computational Statistics Handbook with MATLAB

P. 360

Chapter 9: Statistical Pattern Recognition 349

To grow a tree, we need to have some criterion to help us decide how to
split the nodes. We also need a rule that will tell us when to stop splitting the
nodes, at which point we are finished growing the tree. The stopping rule can
be quite simple, since we first grow an overly large tree. One possible choice
is to continue splitting terminal nodes until each one contains observations
from the same class, in which case some nodes might have only one observa-
tion in the node. Another option is to continue splitting nodes until there is
some maximum number of observations left in a node or the terminal node
is pure (all observations belong to one class). Recommended values for the
maximum number of observations left in a terminal node are between 1
and 5.
We now discuss the splitting rule in more detail. When we split a node, our
goal is to find a split that reduces the impurity in some manner. So, we need
a measure of impurity i(t) for a node t. Breiman, et al. [1984] discuss several
possibilities, one of which is called the Gini diversity index. This is the one
we will use in our implementation of classification trees. The Gini index is
given by

(
it() = ∑ p ω i t( )p ω j t) , (9.19)
i ≠ j

which can also be written as

J
∑ p ω j t) . (9.20)
(
2
it() = 1 –
j = 1
Equation 9.20 is the one we code in the MATLAB function csgrowc for
growing classification trees.
Before continuing with our description of the splitting process, we first
note that our use of the term ‘best’ does not necessarily mean that the split we
find is the optimal one out of all the infinite possible splits. To grow a tree at
a given node, we search for the best split (in terms of decreasing the node
impurity) by first searching through each variable or feature. We have d pos-
sible best splits for a node (one for each feature), and we choose the best one
out of these d splits. The problem now is to search through the infinite num-
ber of possible splits. We can limit our search by using the following conven-
tion. For all feature vectors in our learning sample, we search for the best split
at the k-th feature by proposing splits that are halfway between consecutive
values for that feature. For each proposed split, we evaluate the impurity cri-
terion and choose the split that yields the largest decrease in impurity.
Once we have finished growing our tree, we must assign class labels to the
terminal nodes and determine the corresponding misclassification rate. It
makes sense to assign the class label to a node according to the likelihood that
given that it fell into node t. This is the posterior probability
it is in class ω j
© 2002 by Chapman & Hall/CRC

355 356 357 358 359 360 361 362 363 364 365