Page 115 -
P. 115
4.1 Data Sources 97
emails, PDF documents, scanned text, screen scraping, etc. Even if data is structured
and described by meta data, the sheer complexity of enterprise information systems
may be overwhelming, There is no point in trying to exhaustively extract events logs
from thousands of tables and other data sources. Data extraction should be driven
by questions rather than the availability of lots of data.
In the context of BI and data mining, the phrase “Extract, Transform, and Load”
(ETL) is used to describe the process that involves: (a) extracting data from outside
sources, (b) transforming it to fit operational needs (dealing with syntactical and
semantical issues while ensuring predefined quality levels), and (c) loading it into
the target system, e.g., a data warehouse or relational database. A data warehouse
is a single logical repository of an organization’s transactional and operational data.
The data warehouse does not produce data but simply taps off data from operational
systems. The goal is to unify information such that it can be used for reporting,
analysis, forecasting, etc. Figure 4.1 shows that ETL activities can be used to popu-
late a data warehouse. It may require quite some efforts to create the common view
required for a data warehouse. Different data sources may use different keys, for-
matting conventions, etc. For example, one data source may identify a patient by her
last name and birth date while another data source uses her social security number.
One data source may use the date format “31-12-2010” whereas another uses the
format “2010/12/31”.
If a data warehouse already exists, it most likely holds valuable input for pro-
cess mining. However, many organizations do not have a good data warehouse. The
warehouse may contain only a subset of the information needed for end-to-end pro-
cess mining, e.g., only data related to customers is stored. Moreover, if a data ware-
house is present, it does not need to be process oriented. For example, the typical
warehouse data used for Online Analytical Processing (OLAP) does not provide
much process-related information. OLAP tools are excellent for viewing multidi-
mensional data from different angles, drilling down, and for creating all kinds of
reports. However, OLAP tools do not require the storage of business events and
their ordering. The data sets used by the mainstream data mining approaches de-
scribed in Chap. 3 also do not store such information. For example, a decision tree
learner can be applied to any table consisting of rows (instances) and columns (vari-
ables). As will be shown in the next section, process mining requires information on
relevant events and their order.
Whether there is a data warehouse or not, data needs to be extracted and con-
verted into event logs. Here, scoping is of the utmost importance. Often the problem
is not the syntactical conversion but the selection of suitable data. Questions like
“Which of the more than 10,000 SAP tables to convert?” need to be answered first.
Typical formats to store event logs are XES (eXtensible Event Stream) and MXML
(Mining eXtensible Markup Language). These will be discussed in Sect. 4.3.For the
moment, we assume that one event log corresponds to one process, i.e., when scop-
ing the data in the extraction step, only events relevant for the process to be analyzed
should be included. In Sect. 4.4, we discuss the problem of converting “3-D data”
into “2-D event logs”, i.e., events are projected onto the desired process model.
Depending on the questions and viewpoint chosen, different event logs may be
extracted from the same data set. Consider for example the data in a hospital. One