Page 151 -
P. 151

138    4 Statistical Classification


                              - Second level, right node: car vs. (mas, gla,.fad). Feature AREA was used, with
                                 the threshold AREA=1710.5. This split is achieved with four misclassified cases
                                 (5.7%).
                              - Third level node: gla  vs.  {mas,fad}. Feature DA was used, with the threshold
                                 DA=36.5. This split is achieved with eight misclassified cases (17%).
                                 Notice how the CART approach achieves a tree solution with similar structure to
                              the  one  manually  derived  and  shown  in  Figure  4.41.  The  classification
                              performance  is  somewhat  better  than  previously  obtained.  Notice  the  gradual
                              increase of the errors as one progresses through the tree. Node splitting stops when
                              no significant classification is found, in this case when reaching the (mas, fad}, as
                              expected.


                              4.7  Statistical Classifiers in Data Mining


                              A  current  trend  in  database  technology  applied  to  large  organizations  (e.g.
                              enterprises,  hospitals,  credit  card  companies),  involves  the  concept  of  data
                               warehousing, and sophisticated data search techniques known collectively as data
                              mining.  A  data  warehouse  is  a  database  system  involving  large  tables  whose
                              contents  are periodically  updated, containing detailed  history of  the information,
                               supporting advanced data description and summarizing tools as well as metadata
                              facilities,  i.e., data about  the location and description of  the system components
                               (e.g. names, definitions, structures).
                                 Data  mining  techniques  are  used  to  extract  relevant  information  from  data
                               warehouses and are applied in many diverse fields, such as:

                               - Engineering, e.g. equipment failure prediction, web search engines.
                               - Economy,  e.g.  prediction  of  revenue  of  investment,  detection  of  consumer
                                 profiles, assessment of loan risk.
                               - Biology and medicine, e.g. protein sequencing in genome mapping, assessment
                                 of pregnancy risk.
                                 These techniques  use  pattern  recognition  approaches  such  as  data  clustering,
                               statistical  classification  and  neural  networks,  as  well  as  artificial  intelligence
                               approaches such as knowledge representation, causality models and rule induction.
                                 We  will  discuss  here  some  issues  concerning  the  application  of  statistical
                               classifiers  to  data mining  applications. In  section  5.14 the  application  of  neural
                               networks is presented. Important aspects to consider in data mining applications of
                               pattern recognition techniques are:

                               - The  need  to  operate  with  large  databases,  in  an  on-line  decision  support
                                 environment,  therefore  imposing  strict  requirements  regarding  algorithmic
                                 performance, namely in terms of speed.
   146   147   148   149   150   151   152   153   154   155   156