Page 96 -
P. 96

78                                                     3  Data Mining

            Table 3.4 A fragment of a data set used for sequence mining: each line corresponds to an order
            Customer      Seq. number     Timestamp          Items

            Wil           1               02-01-2011:09.02   {cappuccino}
                          2               03-01-2011:10.06   {espresso,muffin}
                          3               05-01-2011:15.12   {americano,cappuccino}
                          4               06-01-2011:11.18   {espresso,muffin}
                          5               07-01-2011:14.24   {cappuccino}
                          6               07-01-2011:14.24   {americano,cappuccino}
            Mary          1               30-12-2010:11.32   {tea}
                          2               30-12-2010:12.12   {cappuccino}
                          3               30-12-2010:14.16   {espresso,muffin}
                          4               05-01-2011:11.22   {bagel,tea}
            Bill          1               30-12-2010:14.32   {cappuccino}
                          2               30-12-2010:15.06   {cappuccino}
                          3               30-12-2010:16.34   {bagel,espresso,muffin}
                          4               06-01-2011:09.18   {ristretto}
                          5               06-01-2011:12.18   {cappuccino}
            ...           ...             ...                ...



              In principle, there is an infinite number of potential patterns. However, just
            like in the Apriori algorithm a monotonicity property can be exploited: if a se-
            quence is frequent, then its subsequences are also frequent. Therefore, it is pos-
            sible to efficiently generate patterns. Frequent sequences can also be used to create
            rules of the form X ⇒ Y where X is a pattern and Y is an extension or contin-
            uation of the pattern. Consider, for example, X = {cappuccino},{espresso}
 and
            Y = {cappuccino},{espresso},{latte,muffin}
. Suppose that X has a support of
            0.05 and Y has a support of 0.04. Then the confidence of X ⇒ Y is 0.04/0.05 = 0.8,
            i.e., 80% of the customer that ordered a cappuccino followed by an espresso later
            also order a muffin and latte.
              In [89], several extensions of the above approach have been proposed. For ex-
            ample, it is possible to add taxonomies, sliding windows, and time constraints. For
            practical applications, it is important to relax the strict subsequence requirement
            such that a one-to-one matching of item-sets is no longer needed.




            3.5.2 Episode Mining


            Another problem that can be solved using an Apriori-like approach is the discovery
            of frequent episodes [63]. Here a sliding window is used to analyze how frequent
   91   92   93   94   95   96   97   98   99   100   101