Page 151 -
P. 151
138 4 Statistical Classification
- Second level, right node: car vs. (mas, gla,.fad). Feature AREA was used, with
the threshold AREA=1710.5. This split is achieved with four misclassified cases
(5.7%).
- Third level node: gla vs. {mas,fad}. Feature DA was used, with the threshold
DA=36.5. This split is achieved with eight misclassified cases (17%).
Notice how the CART approach achieves a tree solution with similar structure to
the one manually derived and shown in Figure 4.41. The classification
performance is somewhat better than previously obtained. Notice the gradual
increase of the errors as one progresses through the tree. Node splitting stops when
no significant classification is found, in this case when reaching the (mas, fad}, as
expected.
4.7 Statistical Classifiers in Data Mining
A current trend in database technology applied to large organizations (e.g.
enterprises, hospitals, credit card companies), involves the concept of data
warehousing, and sophisticated data search techniques known collectively as data
mining. A data warehouse is a database system involving large tables whose
contents are periodically updated, containing detailed history of the information,
supporting advanced data description and summarizing tools as well as metadata
facilities, i.e., data about the location and description of the system components
(e.g. names, definitions, structures).
Data mining techniques are used to extract relevant information from data
warehouses and are applied in many diverse fields, such as:
- Engineering, e.g. equipment failure prediction, web search engines.
- Economy, e.g. prediction of revenue of investment, detection of consumer
profiles, assessment of loan risk.
- Biology and medicine, e.g. protein sequencing in genome mapping, assessment
of pregnancy risk.
These techniques use pattern recognition approaches such as data clustering,
statistical classification and neural networks, as well as artificial intelligence
approaches such as knowledge representation, causality models and rule induction.
We will discuss here some issues concerning the application of statistical
classifiers to data mining applications. In section 5.14 the application of neural
networks is presented. Important aspects to consider in data mining applications of
pattern recognition techniques are:
- The need to operate with large databases, in an on-line decision support
environment, therefore imposing strict requirements regarding algorithmic
performance, namely in terms of speed.