Page 143 -
P. 143
130 4 Statistical Classification
4.6 Tree Classifiers
4.6.1 Decision Trees and Tables
In multi-class classification tasks one is often confronted with the problem that
reasonable performances can only be achieved through the use of a large number
of features. This requires a very large design set for proper training, probably much
larger than what we have available. Also, the feature subset that is the most
discriminating set for some classes can perform rather poorly for other classes. As
an attempt to overcome these difficulties, a "divide and conquer" principle through
the use of multistage classification has been proposed, the so-called decision tree
classifiers, also known as hierarchical classiJers, where an unknown pattern is
classified into a class using decision functions in successive stages.
At each stage of the tree classifier a simpler problem with a smaller number of
features is solved. This has an additional benefit: in practical multi-class problems
it is rather difficult to guarantee normal or even symmetric distributions with
similar covariance matrices for all the classes, but it may be possible that by using
a multistage approach these conditions are approximately met, affording optimal
classifiers at each stage.
Figure 4.38 shows an example of a simple decision tree with two levels of
classification, 1, and 12. At the first level, 11, a set of four classes Q={wl, Q, a,
w4] is available for classification. However, instead of allowing the assignment of
pattern x to any of the four classes, we limit the assignment to a dichotomic
partition of Q, denoted as Q(I,)={&, Q(12)}. At level 12, the classifier then has to
decide which class of 51(12) the unknown pattern x belongs to, with Q(I2)={w1, Q,
~41.
Let us study the tree classification performance following the work of Kulkarni
(1978), assuming for simplicity that the features used along a path of the tree, in
order to reach a certain node, have independent distributions. The pdf relative to a
class ~k can then be expressed as:
where xj are the subvectors corresponding to the I independent feature subsets used
to classify wk, along a certain tree path.
Denoting the probability of a correct classification of wk by Pc(wk), with the
assumption that the feature subsets used in the classification are independent, one
has, in the same way: