Page 61 -

P. 61

HAN 08-ch01-001-038-9780123814791

24 Chapter 1 Introduction 2011/6/1 3:12 Page 24 #24

models of target classes can be built. In other words, such statistical models can be the
outcome of a data mining task. Alternatively, data mining tasks can be built on top of
statistical models. For example, we can use statistics to model noise and missing data
values. Then, when mining patterns in a large data set, the data mining process can use
the model to help identify and handle noisy or missing values in the data.
Statistics research develops tools for prediction and forecasting using data and sta-
tistical models. Statistical methods can be used to summarize or describe a collection
of data. Basic statistical descriptions of data are introduced in Chapter 2. Statistics is
useful for mining various patterns from data as well as for understanding the underlying
mechanisms generating and affecting the patterns. Inferential statistics (or predictive
statistics) models data in a way that accounts for randomness and uncertainty in the
observations and is used to draw inferences about the process or population under
investigation.
Statistical methods can also be used to verify data mining results. For example, after
a classiﬁcation or prediction model is mined, the model should be veriﬁed by statisti-
cal hypothesis testing. A statistical hypothesis test (sometimes called conﬁrmatory data
analysis) makes statistical decisions using experimental data. A result is called statistically
signiﬁcant if it is unlikely to have occurred by chance. If the classiﬁcation or prediction
model holds true, then the descriptive statistics of the model increases the soundness of
the model.
Applying statistical methods in data mining is far from trivial. Often, a serious chal-
lenge is how to scale up a statistical method over a large data set. Many statistical
methods have high complexity in computation. When such methods are applied on
large data sets that are also distributed on multiple logical or physical sites, algorithms
should be carefully designed and tuned to reduce the computational cost. This challenge
becomes even tougher for online applications, such as online query suggestions in
search engines, where data mining is required to continuously handle fast, real-time
data streams.

1.5.2 Machine Learning
Machine learning investigates how computers can learn (or improve their performance)
based on data. A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data. For example, a
typical machine learning problem is to program a computer so that it can automatically
recognize handwritten postal codes on mail after learning from a set of examples.
Machine learning is a fast-growing discipline. Here, we illustrate classic problems in
machine learning that are highly related to data mining.

Supervised learning is basically a synonym for classiﬁcation. The supervision in the
learning comes from the labeled examples in the training data set. For example, in
the postal code recognition problem, a set of handwritten postal code images and
their corresponding machine-readable translations are used as the training examples,
which supervise the learning of the classiﬁcation model.

56 57 58 59 60 61 62 63 64 65 66