Page 108 -
P. 108

90                                                     3  Data Mining

            corresponds to instances belonging to the same class. However, it is obvious that
            such a decision tree is too specific and has little predictive value.
              A learned model is underfitting if it is too general and allows for things not “sup-
            ported by evidence” in the data set. Whereas overfitting can be characterized by a
            lack of generalization, underfitting has the opposite problem: too much generaliza-
            tion. Consider, for example, the generation of association rules. Generating many
            detailed rules due to very low settings of minsup and minconf , corresponds to over-
            fitting. Many rules are found, but these are probably rather specific for the training
            set. Generating very few rules due to very high settings of minsup and minconf , cor-
            responds to underfitting. In the extreme case, no association rules are found. Note
            that the model with no rules fits any data set and, hence, carries no information.
              Underfitting is particularly problematic if the data set contains no negative ex-
            amples. Consider, for example, the confusion matrix in Fig. 3.14(a). Suppose that
            we have a training set with only positive examples, i.e., n = 0 in the training set.
            How to construct a decision tree without negative examples? Most algorithms will
            simply classify everything as positive. This shows that classification assumes both
            positive and negative examples. This is not the case for association rule learning.
            Consider, for example, the data set shown in Table 3.3. Suppose that the item-set
            {latte,tea,bagel} does not appear in the data set. This implies that no customer
            ordered these three items together in the training set. Can we conclude from this
            that it is not possible to order these three items together? Of course not! Therefore,
            association rule learning focuses on positive examples that are somehow frequent.
            Nevertheless, for some applications it would be useful to be able to discover “nega-
            tive rules” such as the rule that customers are not allowed to order latte’s, teas, and
            bagels in a single order.
              A good balance between overfitting and underfitting is of the utmost importance
            for process discovery. Consider again the Petri net model shown in Fig. 1.5.The
            model allows for the behavior seen in the event log. It also generalizes as it allows
            for more sequences than present in the training set. In the event log there is no trace
             a,h
, i.e., the scenario in which a request is registered and immediately rejected
            does not appear in the log. This does not necessarily imply that this is not possible.
            However, constructing a model that allows for  a,h
 although it is not in the log
            would result in a model that is clearly underfitting. This dilemma is caused by the
            lack of negative examples in the event log. The traces in the event log show what has
            happened and not what could not happen. This problem will be addressed in later
            chapters.
              We conclude this chapter with Occam’s Razor, a principle attributed to the 14th-
            century English logician William of Ockham. The principle states that “one should
            not increase, beyond what is necessary, the number of entities required to explain
            anything”, i.e., one should look for the “simplest model” that can explain what is ob-
            served in the data set. This principle is related to finding a natural balance between
            overfitting and underfitting. The Minimal Description Length (MDL) principle tries
            to operationalize Occam’s Razor [47, 129]. According to the MDL paradigm, model
            quality is no longer only based on predicting performance (e.g., F1 score), but also
            on the simplicity of the model. Moreover, it does not aim at cross-validation in the
   103   104   105   106   107   108   109   110   111   112   113