Page 95 -
P. 95

3.5 Sequence and Episode Mining                                 77

              Association rules are related to process discovery. Recall that the α-algorithm
            also traverses the event log looking for patterns. However, association rules do not
            consider the ordering of activities and do not aim to build an overall process model.




            3.5 Sequence and Episode Mining

            The Apriori algorithm uses the monotonicity property that all subsets of a frequent
            item-set are also frequent. Many other pattern or rule discovery problems have simi-
            lar monotonicity properties, thus enabling efficient implementations. A well-known
            example is the mining of sequential patterns. After introducing sequence mining,
            we also describe an approach to discover frequent episodes and mention some other
            data mining techniques relevant for process mining.



            3.5.1 Sequence Mining


            The Apriori algorithm does not consider the ordering of events. Sequence min-
            ing overcomes this problem by analyzing sequences of item-sets. One of the
            early approaches was developed by Srikant and Agrawal [89]. Here, we sketch
            the essence of this approach. To explain the problem addressed by sequence
            mining, we consider the data set shown in Table 3.4. Each line corresponds
            to a customer ordering a set of items, e.g., at 9.02 on January 2nd 2011,
            Wil ordered a cappuccino, one day later he orders an espresso and a muf-
            fin. Per customer there is a sequence of orders. Orders have a sequence num-
            ber, a timestamp, and an item-set. A more compact representation of the first
            customer sequence is  {cappuccino},{espresso,muffin},{americano,cappuccino},
            {espresso,muffin},{cappuccino},{americano,cappuccino}
. The goal is to find
            frequent sequences defined by a pattern like  {cappuccino},{espresso,muffin},
            {cappuccino}
. A sequence is frequent if the pattern is contained in a predefined
            proportion of the customer sequences in the data set.
              A sequence  a 1 ,a 2 ,...,a n 
 is a subsequence of another sequence  b 1 ,b 2 ,...,
                                                                      ,...,a n ⊆
            b m 
 if there exist integers i 1 <i 2 < ··· <i n such that a 1 ⊆ b i 1  , a 2 ⊆ b i 2
              . For example, the sequence  {x},{x,y},{y}
 is a subsequence of  {z},{x},{z},
            b i n
            {x,y,z},{y,z},{z}
 because {x}⊆{x}, {x,y}⊆{x,y,z}, and {y}⊆{y,z}.How-
            ever,  {x},{y}
 is not a subsequence of  {x,y}
 and vice versa. The support of a
            sequence s is the fraction of sequences in the data set that has s as a subsequence.
            A sequence is frequent if its support meets some threshold minsup. Consider, for
            example, the data sets consisting of just the three visible customer sequences in Ta-
            ble 3.4. Pattern  {tea},{bagel,tea}
 has a support of 1/3 as it is only a subsequence
            of Mary’s sequence. Pattern  {espresso},{cappuccino}
 has a support of 2/3as it is a
            subsequence of both Wil’s and Bill’s subsequences, but not a subsequence of Mary’s
            sequence. Pattern  {cappuccino},{espresso,muffin}
 has a support of 3/3 = 1.
   90   91   92   93   94   95   96   97   98   99   100