Page 107 -
P. 107

3.6 Quality of Resulting Models                                 89

             a,d,b,e,h
, a,c,d,e,f,d,c,e,f,b,d,e,h
, a,c,d,e,g
}. Clearly, there is a
            representational bias. The assumption is that the process can be presented by a Petri
            net where every transition bears a unique and visible label. Many processes cannot
            be represented by such a Petri net. The α-algorithm also has a learning bias as it is
            focusing on “direct succession”. If a is directly followed by b in the event log, then
            this information is used. However, an observation such as “a is eventually followed
            by b in the event log” is not exploited by the α-algorithm.
              An inductive bias is not necessarily bad. In fact it is often needed to come to a
            solution. However, the analyst should be aware of this and reflect on the implicit
            choices made.


              Curse of Dimensionality
              Some data sets have many variables. However, for most data mining problems
              the amount of data needed to maintain a specific level of accuracy is expo-
              nential in the number of parameters [52]. High-dimensional problems, i.e.,
              analyzing a data set with many variables, may be computationally intractable
              or lead to incorrect conclusions. This is the “curse of dimensionality” that
              many real-life applications of data mining are confronted with. Consider, for
              example, a supermarket selling 1000 products. In this case, there are 2 1000  −1
              potential item-sets. Although the Apriori algorithm can quickly rule out many
              irrelevant candidates, the generation of association rules in such a setting is
              likely to encounter performance problems. Moreover, the interpretation of the
              results is typically difficult due to an excessive number of potential rules.
              In a supermarket having hundreds or thousands of products, there are many
              customers that purchase a unique combination of products. If there are 1000
              different products, then there are 2 1000  − 1 ≈ 1.07 × 10 301  possible shop-
              ping lists (ignoring quantities). Although the probability that two customers
              purchase the same is small, the number of potential rules is very large. This
              problem is not restricted to association rule learning. Clustering or regression
              in a 1000 dimensional space will suffer from similar problems. Typical ap-
              proaches to address this problem are variable selection and transformation
              [52]. The goal of variable selection is to simply remove irrelevant of redun-
              dant variables. For example, the student’s registration number and address
              are irrelevant when predicting study progress. Sometimes the data set can be
              transformed to reduce dimensionality, e.g., taking the average mark rather
              than individual marks per course.



              Another problem is the delicate balance between overfitting and underfitting.
            A learned model is overfitting if it is too specific and too much driven by acci-
            dental information in the data set. For example, when constructing a decision tree
            for a training set without conflicting input (i.e., instances with identical attributes
            belong to the same class), it is easy to construct a decision tree with a perfect F1
            score. This tree can be obtained by continuing to split nodes until each leaf node
   102   103   104   105   106   107   108   109   110   111   112