Page 107 -
P. 107
3.6 Quality of Resulting Models 89
a,d,b,e,h
, a,c,d,e,f,d,c,e,f,b,d,e,h
, a,c,d,e,g
}. Clearly, there is a
representational bias. The assumption is that the process can be presented by a Petri
net where every transition bears a unique and visible label. Many processes cannot
be represented by such a Petri net. The α-algorithm also has a learning bias as it is
focusing on “direct succession”. If a is directly followed by b in the event log, then
this information is used. However, an observation such as “a is eventually followed
by b in the event log” is not exploited by the α-algorithm.
An inductive bias is not necessarily bad. In fact it is often needed to come to a
solution. However, the analyst should be aware of this and reflect on the implicit
choices made.
Curse of Dimensionality
Some data sets have many variables. However, for most data mining problems
the amount of data needed to maintain a specific level of accuracy is expo-
nential in the number of parameters [52]. High-dimensional problems, i.e.,
analyzing a data set with many variables, may be computationally intractable
or lead to incorrect conclusions. This is the “curse of dimensionality” that
many real-life applications of data mining are confronted with. Consider, for
example, a supermarket selling 1000 products. In this case, there are 2 1000 −1
potential item-sets. Although the Apriori algorithm can quickly rule out many
irrelevant candidates, the generation of association rules in such a setting is
likely to encounter performance problems. Moreover, the interpretation of the
results is typically difficult due to an excessive number of potential rules.
In a supermarket having hundreds or thousands of products, there are many
customers that purchase a unique combination of products. If there are 1000
different products, then there are 2 1000 − 1 ≈ 1.07 × 10 301 possible shop-
ping lists (ignoring quantities). Although the probability that two customers
purchase the same is small, the number of potential rules is very large. This
problem is not restricted to association rule learning. Clustering or regression
in a 1000 dimensional space will suffer from similar problems. Typical ap-
proaches to address this problem are variable selection and transformation
[52]. The goal of variable selection is to simply remove irrelevant of redun-
dant variables. For example, the student’s registration number and address
are irrelevant when predicting study progress. Sometimes the data set can be
transformed to reduce dimensionality, e.g., taking the average mark rather
than individual marks per course.
Another problem is the delicate balance between overfitting and underfitting.
A learned model is overfitting if it is too specific and too much driven by acci-
dental information in the data set. For example, when constructing a decision tree
for a training set without conflicting input (i.e., instances with identical attributes
belong to the same class), it is easy to construct a decision tree with a perfect F1
score. This tree can be obtained by continuing to split nodes until each leaf node