Page 127 -

P. 127

10-ch03-083-124-9780123814791
HAN

90 Chapter 3 Data Preprocessing 2011/6/1 3:16 Page 90 #8

Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34

Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29

Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34

Figure 3.2 Binning methods for data smoothing.

greater the effect of the smoothing. Alternatively, bins may be equal width, where the
interval range of values in each bin is constant. Binning is also used as a discretization
technique and is further discussed in Section 3.5.
Regression: Data smoothing can also be done by regression, a technique that con-
forms data values to a function. Linear regression involves ﬁnding the “best” line to
ﬁt two attributes (or variables) so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are ﬁt to a multidimensional surface. Regression
is further described in Section 3.4.5.
Outlier analysis: Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.” Intuitively, values that fall outside of
the set of clusters may be considered outliers (Figure 3.3). Chapter 12 is dedicated to
the topic of outlier analysis.
Many data smoothing methods are also used for data discretization (a form of data
transformation) and data reduction. For example, the binning techniques described
before reduce the number of distinct values per attribute. This acts as a form of data
reduction for logic-based data mining methods, such as decision tree induction, which
repeatedly makes value comparisons on sorted data. Concept hierarchies are a form of
data discretization that can also be used for data smoothing. A concept hierarchy for
price, for example, may map real price values into inexpensive, moderately priced, and
expensive, thereby reducing the number of data values to be handled by the mining

122 123 124 125 126 127 128 129 130 131 132