Page 145 -
P. 145

10-ch03-083-124-9780123814791
                         HAN

          108   Chapter 3 Data Preprocessing                2011/6/1  3:16 Page 108  #26



                           Histograms are highly effective at approximating both sparse and dense data, as
                         well as highly skewed and uniform data. The histograms described before for single
                         attributes can be extended for multiple attributes. Multidimensional histograms can cap-
                         ture dependencies between attributes. These histograms have been found effective in
                         approximating data with up to five attributes. More studies are needed regarding the
                         effectiveness of multidimensional histograms for high dimensionalities.
                           Singleton buckets are useful for storing high-frequency outliers.


                   3.4.7 Clustering
                         Clustering techniques consider data tuples as objects. They partition the objects into
                         groups, or clusters, so that objects within a cluster are “similar” to one another and “dis-
                         similar” to objects in other clusters. Similarity is commonly defined in terms of how
                         “close” the objects are in space, based on a distance function. The “quality” of a cluster
                         may be represented by its diameter, the maximum distance between any two objects in
                         the cluster. Centroid distance is an alternative measure of cluster quality and is defined
                         as the average distance of each cluster object from the cluster centroid (denoting the
                         “average object,” or average point in space for the cluster). Figure 3.3 showed a 2-D plot
                         of customer data with respect to customer locations in a city. Three data clusters are
                         visible.
                           In data reduction, the cluster representations of the data are used to replace the actual
                         data. The effectiveness of this technique depends on the data’s nature. It is much more
                         effective for data that can be organized into distinct clusters than for smeared data.
                           There are many measures for defining clusters and cluster quality. Clustering meth-
                         ods are further described in Chapters 10 and 11.


                   3.4.8 Sampling
                         Sampling can be used as a data reduction technique because it allows a large data set to
                         be represented by a much smaller random data sample (or subset). Suppose that a large
                         data set, D, contains N tuples. Let’s look at the most common ways that we could sample
                         D for data reduction, as illustrated in Figure 3.9.

                           Simple random sample without replacement (SRSWOR) of size s: This is created
                           by drawing s of the N tuples from D (s < N), where the probability of drawing any
                           tuple in D is 1/N, that is, all tuples are equally likely to be sampled.
                           Simple random sample with replacement (SRSWR) of size s: This is similar to
                           SRSWOR, except that each time a tuple is drawn from D, it is recorded and then
                           replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn
                           again.
                           Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,”
                           then an SRS of s clusters can be obtained, where s < M. For example, tuples in a
                           database are usually retrieved a page at a time, so that each page can be considered
   140   141   142   143   144   145   146   147   148   149   150