Page 189 -
P. 189
6.3 Genetic Process Mining 171
Therefore, a more refined fitness function needs to be used that also rewards the
partial correctness of the model and takes into account all four competing quality
criteria described in Sect. 5.4.3. The best individuals, i.e., the process models having
the highest fitness value are moved to the next generation. This is called elitism.For
instance, the best 1% of the current generation is passed on to the next generation
without any modifications. Through tournaments, “parents” are selected for creating
new individuals. Tournaments among individuals and elitism should make sure that
the “genetic material of the best process models” has the highest probability of being
used for the next generation: survival of the fittest. As a result, individuals with
a poor fitness are unlikely to survive. Figure 6.8 refers to such models as “dead”
individuals.
In the reproduction phase, the selected parent individuals are used to create
new offspring. Here two genetic operators are used: crossover and mutation.For
crossover two individuals are taken and used to create two new models; these end up
in the pool with “child models” shown in Fig. 6.8. These child models share parts of
the genetic material of their parents. The resulting children are then modified using
mutation, e.g., randomly adding or deleting a causal dependency. Mutation is used
to insert new generic material in the next generation. Without mutation, evolution
beyond the genetic material in the initial population is impossible.
Through reproduction (i.e., crossover and mutation) and elitism, a new genera-
tion is created. For the models in this generation the fitness is computed. Again the
best individuals move on to the next round (elitism) or are used to produce new off-
spring. This is repeated and the expectation is that the “quality” of each generation
gets better and better. The evolution process terminates when a satisfactory solution
is found, i.e., a model having at least the desired fitness. Depending on the event log
it may take a very long time to converge. In fact, due to the representational bias
and noise in the event log there may not be a model that has the desired level of fit-
ness. Therefore, other termination criteria may be added (e.g., a maximum number
of generations or stopping when 10 successive generations do not produce better
individuals). When terminating, a model with the best fitness is returned.
The approach described in Fig. 6.8 is very general. When actually implementing
a genetic process mining algorithm the following design choices need to made:
• Representation of individuals. Each individual corresponds to a process model
described in a particular language, e.g., Petri nets, C-nets, BPMN, or EPCs. This
choice is important as it determines the class of processes that can be discovered
(representational bias). Moreover, it should be possible to define suitable genetic
operators for the representation chosen. In [31], a variant of C-nets is used.
• Initialization. For the initial population, models need to be generated randomly.
In [31], two approaches are proposed: (a) an approach where with a certain prob-
ability a causal dependency between two activities is inserted to create C-nets
and (b) an approach in which a randomized variant of heuristic mining is used
to create an initial population with a higher average fitness than purely randomly
generated C-nets.
• Fitness function. Here, the challenge is to define a function that balances the four
competing quality criteria described in Sect. 5.4.3. Many fitness functions can be