Page 106 -
P. 106
88 3 Data Mining
There are two advantages associated to k-fold cross-validation. First of all, all
data is used both as training data and test data. Second, if desired, one gets k tests
of the desired performance indicator rather than just one. Formally, the tests cannot
be considered to be independent as the training sets used in the k folds overlap
considerably. Nevertheless, the k folds make it possible to get more insight into the
reliability.
An extreme variant of k-fold cross-validation is “leave-one-out” cross-validation,
also known as jack-knifing. Here k = N, i.e., the test sets contain only one element
at a time. See [5, 67] for more information on the various forms of cross-validation.
3.6.3 Occam’s Razor
Evaluating the quality of data mining results is far from trivial. In this subsection,
we discuss some additional complications that are also relevant for process mining.
Learning is typically an “ill posed problem”, i.e., only examples are given. Some
examples may rule out certain solutions, however, typically many possible models
remain. Moreover, there is typically a bias in both the target representation and the
learning algorithm. Consider, for example, the sequence 2,3,5,7,11,.... What is
the next element in this sequence? Most readers will guess that it is 13, i.e., the next
prime number, but there are infinitely many sequences that start with 2,3,5,7,11.
Yet, there seems to be preference for hypothesizing about some solutions. The term
inductive bias refers to a preference for one solution rather than another which can-
not be determined by the data itself but which is driven by external factors.
A representational bias refers to choices that are implicitly made by selecting a
particular representation. For example, in Sect. 3.2, we assumed that in a decision
tree the same attribute may appear only once on a path. This representational bias
rules out certain solutions, e.g., a decision tree where closer to the root a numerical
attribute is used in a coarse-grained manner and in some subtrees it is used in a fine-
grained manner. Linear regression also makes assumptions about the function used
to best fit the data. The function is assumed to be linear although there may be other
non-linear functions that fit the data much better. Note that a representational bias
is not necessarily bad, e.g., linear regression has been successfully used in many
application domains. However, it is important to realize that the search space is
limited by the representation used. The limitations can guide the search process, but
also exclude good solutions.
A learning bias refers to strategies used by the algorithm that give preference to
particular solutions. For example, in Fig. 3.4, we used the criterion of information
gain (reduction of entropy) to select attributes. However, we could also have used
the Gini index of diversity G rather than entropy E to select attributes, thus resulting
in different decision trees.
Both factors also play a role in process mining. Consider, for example, Fig. 1.5
in the first chapter. This process model was discovered using the α-algorithm [103]
based on the set of traces { a,b,d,e,h
, a,d,c,e,g
, a,c,d,e,f,b,d,e,g
,