Page 282 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R

P. 282

6.7 Tree Classifiers 263

or not a given case belongs to a set of categories. For instance, this type of trees is
frequently used in medical applications, and often built as a result of statistical
studies of the influence of individual health factors in a given population.
The design of decision trees can be automated in many ways, depending on the
split criterion used at each node, and the type of search used for best group
discrimination. A split criterion has the form:

d(x) ≥ ∆,

where d(x) is a decision function of the feature vector x and ∆ is a threshold.
Usually, linear decision functions are used. In many applications, the split criteria
are expressed in terms of the individual features alone (the so-called univariate
splits).
An important concept regarding split criteria is the concept of node impurity.
The node impurity is a function of the fraction of cases belonging to a specific
class at that node.
Consider the two-class situation shown in Figure 6.27. Initially, we have a node
with equal proportions of cases belonging to the two classes (white and black
circles). We say that its impurity is maximal. The right split results in nodes with
zero impurity, since they contain cases from only one of the classes. The left split,
on the contrary, increases the proportion of cases from one of the classes, therefore
decreasing the impurity, although some impurity remains present.

x 2 t 1 t 2

x 1

t 11 t 12 t 21 t 22

Figure 6.27. Splitting a node with maximum impurity. The left split (x 1 ≥ ∆)
decreases the impurity, which is still non-zero; the right split (w 1x 1 + w 2x 2 ≥ ∆)
achieves pure nodes.

A popular measure of impurity, expressed in the [0, 1] interval, is the Gini index
of diversity:

c
() ∑ P ( j | t ) (k | . 6.30
=
t i
) t
P
j, k 1 =
j≠ k
For the situation shown in Figure 6.27, we have:

277 278 279 280 281 282 283 284 285 286 287