Page 283 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R
P. 283

264      6 Statistical Classification


              i(t 1) = i(t 2) = 1×1= 1;
                          2  1  2
              i(t 11) = i(t 12) =   =  ;
                          3  3  9
              i(t 21) = i(t 22) = 1×0 = 0.

              In the automatic generation of binary trees the tree starts at the root node, which
           corresponds to the whole training set. Then, it progresses by searching for each
           variable the threshold level  achieving the  maximum decrease of the impurity at
           each node. The generation of splits stops  when  no significant decrease of the
           impurity is achieved. It is common practice to use the individual feature values of
           the training set cases as candidate threshold values. Sometimes, after generating a
           tree automatically, some sort of  tree pruning should be performed in order to
           remove branches of no interest.
              SPSS and STATISTICA have specific commands for designing tree classifiers,
           based on univariate splits. The method of exhaustive search for the best univariate
           splits is usually called the CRT (also CART or C&RT) method,  pioneered by
           Breiman, Friedman, Olshen and Stone (see Breiman et al., 1993).


           Example 6.17
           Q: Use the  CR T    approach  with univariate splits and the Gini index  as splitting
           criterion in order to  derive  a decision tree for the  Breast Tissue   dataset.
           Assume equal priors of the classes.
           A: Applying the commands for CRT univariate split with the Gini index, described
           in Commands 6.3, the tree presented in Figure 6.28 was found with SPSS (same
           solution with STATISTICA). The tree shows the split thresholds at each node as
           well as the improvement achieved in the Gini index. For instance, the first split
           variable PERIM was selected with a threshold level of 1563.84.


           Table 6.13. Training set classification matrix, obtained with SPSS, corresponding
           to the tree shown in Figure 6.28.
             Observed                         Predicted
                                                                        Percent
                        car     fad     mas     gla      con     adi
                                                                        Correct
            car         20       0       1       0       0        0       95.2%
            fad          0       0      12       3       0        0        0.0%
            mas          2       0      15       1       0        0       83.3%
            gla          1       0       4       11      0        0       68.8%
            con          0       0       0       0       14       0      100.0%
            adi          0       0       0       0       1       21       95.5%
   278   279   280   281   282   283   284   285   286   287   288