Page 106 -
P. 106

88                                                     3  Data Mining

              There are two advantages associated to k-fold cross-validation. First of all, all
            data is used both as training data and test data. Second, if desired, one gets k tests
            of the desired performance indicator rather than just one. Formally, the tests cannot
            be considered to be independent as the training sets used in the k folds overlap
            considerably. Nevertheless, the k folds make it possible to get more insight into the
            reliability.
              An extreme variant of k-fold cross-validation is “leave-one-out” cross-validation,
            also known as jack-knifing. Here k = N, i.e., the test sets contain only one element
            at a time. See [5, 67] for more information on the various forms of cross-validation.




            3.6.3 Occam’s Razor

            Evaluating the quality of data mining results is far from trivial. In this subsection,
            we discuss some additional complications that are also relevant for process mining.
              Learning is typically an “ill posed problem”, i.e., only examples are given. Some
            examples may rule out certain solutions, however, typically many possible models
            remain. Moreover, there is typically a bias in both the target representation and the
            learning algorithm. Consider, for example, the sequence 2,3,5,7,11,.... What is
            the next element in this sequence? Most readers will guess that it is 13, i.e., the next
            prime number, but there are infinitely many sequences that start with 2,3,5,7,11.
            Yet, there seems to be preference for hypothesizing about some solutions. The term
            inductive bias refers to a preference for one solution rather than another which can-
            not be determined by the data itself but which is driven by external factors.
              A representational bias refers to choices that are implicitly made by selecting a
            particular representation. For example, in Sect. 3.2, we assumed that in a decision
            tree the same attribute may appear only once on a path. This representational bias
            rules out certain solutions, e.g., a decision tree where closer to the root a numerical
            attribute is used in a coarse-grained manner and in some subtrees it is used in a fine-
            grained manner. Linear regression also makes assumptions about the function used
            to best fit the data. The function is assumed to be linear although there may be other
            non-linear functions that fit the data much better. Note that a representational bias
            is not necessarily bad, e.g., linear regression has been successfully used in many
            application domains. However, it is important to realize that the search space is
            limited by the representation used. The limitations can guide the search process, but
            also exclude good solutions.
              A learning bias refers to strategies used by the algorithm that give preference to
            particular solutions. For example, in Fig. 3.4, we used the criterion of information
            gain (reduction of entropy) to select attributes. However, we could also have used
            the Gini index of diversity G rather than entropy E to select attributes, thus resulting
            in different decision trees.
              Both factors also play a role in process mining. Consider, for example, Fig. 1.5
            in the first chapter. This process model was discovered using the α-algorithm [103]
            based on the set of traces { a,b,d,e,h
, a,d,c,e,g
, a,c,d,e,f,b,d,e,g
,
   101   102   103   104   105   106   107   108   109   110   111