Page 61 -
P. 61

HAN 08-ch01-001-038-9780123814791


          24    Chapter 1 Introduction                       2011/6/1  3:12  Page 24  #24



                         models of target classes can be built. In other words, such statistical models can be the
                         outcome of a data mining task. Alternatively, data mining tasks can be built on top of
                         statistical models. For example, we can use statistics to model noise and missing data
                         values. Then, when mining patterns in a large data set, the data mining process can use
                         the model to help identify and handle noisy or missing values in the data.
                           Statistics research develops tools for prediction and forecasting using data and sta-
                         tistical models. Statistical methods can be used to summarize or describe a collection
                         of data. Basic statistical descriptions of data are introduced in Chapter 2. Statistics is
                         useful for mining various patterns from data as well as for understanding the underlying
                         mechanisms generating and affecting the patterns. Inferential statistics (or predictive
                         statistics) models data in a way that accounts for randomness and uncertainty in the
                         observations and is used to draw inferences about the process or population under
                         investigation.
                           Statistical methods can also be used to verify data mining results. For example, after
                         a classification or prediction model is mined, the model should be verified by statisti-
                         cal hypothesis testing. A statistical hypothesis test (sometimes called confirmatory data
                         analysis) makes statistical decisions using experimental data. A result is called statistically
                         significant if it is unlikely to have occurred by chance. If the classification or prediction
                         model holds true, then the descriptive statistics of the model increases the soundness of
                         the model.
                           Applying statistical methods in data mining is far from trivial. Often, a serious chal-
                         lenge is how to scale up a statistical method over a large data set. Many statistical
                         methods have high complexity in computation. When such methods are applied on
                         large data sets that are also distributed on multiple logical or physical sites, algorithms
                         should be carefully designed and tuned to reduce the computational cost. This challenge
                         becomes even tougher for online applications, such as online query suggestions in
                         search engines, where data mining is required to continuously handle fast, real-time
                         data streams.


                   1.5.2 Machine Learning
                         Machine learning investigates how computers can learn (or improve their performance)
                         based on data. A main research area is for computer programs to automatically learn to
                         recognize complex patterns and make intelligent decisions based on data. For example, a
                         typical machine learning problem is to program a computer so that it can automatically
                         recognize handwritten postal codes on mail after learning from a set of examples.
                           Machine learning is a fast-growing discipline. Here, we illustrate classic problems in
                         machine learning that are highly related to data mining.

                           Supervised learning is basically a synonym for classification. The supervision in the
                           learning comes from the labeled examples in the training data set. For example, in
                           the postal code recognition problem, a set of handwritten postal code images and
                           their corresponding machine-readable translations are used as the training examples,
                           which supervise the learning of the classification model.
   56   57   58   59   60   61   62   63   64   65   66