Page 58 -
P. 58

3:12
                                                                           Page 21
                          HAN 08-ch01-001-038-9780123814791
                                                             2011/6/1
                                                                                   #21
                                                            1.4 What Kinds of Patterns Can Be Mined?  21


                               events can be more interesting than the more regularly occurring ones. The analysis of
                               outlier data is referred to as outlier analysis or anomaly mining.
                                 Outliers may be detected using statistical tests that assume a distribution or proba-
                               bility model for the data, or using distance measures where objects that are remote from
                               any other cluster are considered outliers. Rather than using statistical or distance mea-
                               sures, density-based methods may identify outliers in a local region, although they look
                               normal from a global statistical distribution view.

                 Example 1.10 Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by
                               detecting purchases of unusually large amounts for a given account number in compari-
                               son to regular charges incurred by the same account. Outlier values may also be detected
                               with respect to the locations and types of purchase, or the purchase frequency.


                               Outlier analysis is discussed in Chapter 12.

                         1.4.6 Are All Patterns Interesting?
                               A data mining system has the potential to generate thousands or even millions of
                               patterns, or rules.
                                 You may ask, “Are all of the patterns interesting?” Typically, the answer is no—only
                               a small fraction of the patterns potentially generated would actually be of interest to a
                               given user.
                                 This raises some serious questions for data mining. You may wonder, “What makes a
                               pattern interesting? Can a data mining system generate all of the interesting patterns? Or,
                               Can the system generate only the interesting ones?”
                                 To answer the first question, a pattern is interesting if it is (1) easily understood by
                               humans, (2) valid on new or test data with some degree of certainty, (3) potentially
                               useful, and (4) novel. A pattern is also interesting if it validates a hypothesis that the user
                               sought to confirm. An interesting pattern represents knowledge.
                                 Several objective measures of pattern interestingness exist. These are based on
                               the structure of discovered patterns and the statistics underlying them. An objective
                               measure for association rules of the form X ⇒ Y is rule support, representing the per-
                               centage of transactions from a transaction database that the given rule satisfies. This is
                               taken to be the probability P(X ∪ Y), where X ∪ Y indicates that a transaction contains
                               both X and Y, that is, the union of itemsets X and Y. Another objective measure for
                               association rules is confidence, which assesses the degree of certainty of the detected
                               association. This is taken to be the conditional probability P(Y|X), that is, the prob-
                               ability that a transaction containing X also contains Y. More formally, support and
                               confidence are defined as
                                                      support(X ⇒ Y) = P(X ∪ Y),
                                                     confidence(X ⇒ Y) = P(Y|X).

                               In general, each interestingness measure is associated with a threshold, which may be
                               controlled by the user. For example, rules that do not satisfy a confidence threshold of,
   53   54   55   56   57   58   59   60   61   62   63