Page 95 -
P. 95
3.5 Sequence and Episode Mining 77
Association rules are related to process discovery. Recall that the α-algorithm
also traverses the event log looking for patterns. However, association rules do not
consider the ordering of activities and do not aim to build an overall process model.
3.5 Sequence and Episode Mining
The Apriori algorithm uses the monotonicity property that all subsets of a frequent
item-set are also frequent. Many other pattern or rule discovery problems have simi-
lar monotonicity properties, thus enabling efficient implementations. A well-known
example is the mining of sequential patterns. After introducing sequence mining,
we also describe an approach to discover frequent episodes and mention some other
data mining techniques relevant for process mining.
3.5.1 Sequence Mining
The Apriori algorithm does not consider the ordering of events. Sequence min-
ing overcomes this problem by analyzing sequences of item-sets. One of the
early approaches was developed by Srikant and Agrawal [89]. Here, we sketch
the essence of this approach. To explain the problem addressed by sequence
mining, we consider the data set shown in Table 3.4. Each line corresponds
to a customer ordering a set of items, e.g., at 9.02 on January 2nd 2011,
Wil ordered a cappuccino, one day later he orders an espresso and a muf-
fin. Per customer there is a sequence of orders. Orders have a sequence num-
ber, a timestamp, and an item-set. A more compact representation of the first
customer sequence is {cappuccino},{espresso,muffin},{americano,cappuccino},
{espresso,muffin},{cappuccino},{americano,cappuccino}
. The goal is to find
frequent sequences defined by a pattern like {cappuccino},{espresso,muffin},
{cappuccino}
. A sequence is frequent if the pattern is contained in a predefined
proportion of the customer sequences in the data set.
A sequence a 1 ,a 2 ,...,a n
is a subsequence of another sequence b 1 ,b 2 ,...,
,...,a n ⊆
b m
if there exist integers i 1 <i 2 < ··· <i n such that a 1 ⊆ b i 1 , a 2 ⊆ b i 2
. For example, the sequence {x},{x,y},{y}
is a subsequence of {z},{x},{z},
b i n
{x,y,z},{y,z},{z}
because {x}⊆{x}, {x,y}⊆{x,y,z}, and {y}⊆{y,z}.How-
ever, {x},{y}
is not a subsequence of {x,y}
and vice versa. The support of a
sequence s is the fraction of sequences in the data set that has s as a subsequence.
A sequence is frequent if its support meets some threshold minsup. Consider, for
example, the data sets consisting of just the three visible customer sequences in Ta-
ble 3.4. Pattern {tea},{bagel,tea}
has a support of 1/3 as it is only a subsequence
of Mary’s sequence. Pattern {espresso},{cappuccino}
has a support of 2/3as it is a
subsequence of both Wil’s and Bill’s subsequences, but not a subsequence of Mary’s
sequence. Pattern {cappuccino},{espresso,muffin}
has a support of 3/3 = 1.