Page 282 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R
P. 282

6.7 Tree Classifiers   263


           or not a given case belongs to a set of categories. For instance, this type of trees is
           frequently used in medical  applications,  and  often  built as a result of statistical
           studies of the influence of individual health factors in a given population.
              The design of decision trees can be automated in many ways, depending on the
           split criterion used  at each node, and the type of  search used for best  group
           discrimination. A split criterion has the form:

              d(x) ≥ ∆,

           where  d(x) is a decision  function of the feature vector  x and  ∆ is a  threshold.
           Usually, linear decision functions are used. In many applications, the split criteria
           are expressed  in terms of the individual features alone (the so-called  univariate
           splits).
              An important concept regarding split criteria is the concept of node impurity.
           The node impurity is a function of the fraction of cases belonging to a specific
           class at that node.
              Consider the two-class situation shown in Figure 6.27. Initially, we have a node
           with equal proportions of cases belonging to the two classes (white and black
           circles). We say that its impurity is maximal. The right split results in nodes with
           zero impurity, since they contain cases from only one of the classes. The left split,
           on the contrary, increases the proportion of cases from one of the classes, therefore
           decreasing the impurity, although some impurity remains present.



                         x 2        t 1                 t 2


                               x 1

                                t 11     t 12       t 21      t 22


           Figure 6.27.  Splitting a  node  with maximum  impurity. The left split (x 1  ≥ ∆)
           decreases the  impurity, which is still non-zero; the right split (w 1x 1  +  w 2x 2  ≥ ∆)
           achieves pure nodes.



              A popular measure of impurity, expressed in the [0, 1] interval, is the Gini index
           of diversity:

                    c
               () ∑   P ( j |  t ) (k | .                                  6.30
                 =
               t i
                                ) t
                           P
                   j, k 1 =
                   j≠ k
           For the situation shown in Figure 6.27, we have:
   277   278   279   280   281   282   283   284   285   286   287