Page 68 -

P. 68

3:12
Page 31
#31
2011/6/1
HAN 08-ch01-001-038-9780123814791
1.7 Major Issues in Data Mining 31

into the knowledge discovery process. Such knowledge can be used for pattern
evaluation as well as to guide the search toward interesting patterns.
Ad hoc data mining and data mining query languages: Query languages (e.g., SQL)
have played an important role in ﬂexible searching because they allow users to pose
ad hoc queries. Similarly, high-level data mining query languages or other high-level
ﬂexible user interfaces will give users the freedom to deﬁne ad hoc data mining tasks.
This should facilitate speciﬁcation of the relevant sets of data for analysis, the domain
knowledge, the kinds of knowledge to be mined, and the conditions and constraints
to be enforced on the discovered patterns. Optimization of the processing of such
ﬂexible mining requests is another promising area of study.
Presentation and visualization of data mining results: How can a data mining system
present data mining results, vividly and ﬂexibly, so that the discovered knowledge
can be easily understood and directly usable by humans? This is especially crucial
if the data mining process is interactive. It requires the system to adopt expressive
knowledge representations, user-friendly interfaces, and visualization techniques.

1.7.3 Efﬁciency and Scalability
Efﬁciency and scalability are always considered when comparing data mining algo-
rithms. As data amounts continue to multiply, these two factors are especially critical.

Efﬁciency and scalability of data mining algorithms: Data mining algorithms must be
efﬁcient and scalable in order to effectively extract information from huge amounts
of data in many data repositories or in dynamic data streams. In other words, the
running time of a data mining algorithm must be predictable, short, and acceptable
by applications. Efﬁciency, scalability, performance, optimization, and the ability to
execute in real time are key criteria that drive the development of many new data
mining algorithms.
Parallel, distributed, and incremental mining algorithms: The humongous size of many
data sets, the wide distribution of data, and the computational complexity of some
data mining methods are factors that motivate the development of parallel and dis-
tributed data-intensive mining algorithms. Such algorithms ﬁrst partition the data
into “pieces.” Each piece is processed, in parallel, by searching for patterns. The par-
allel processes may interact with one another. The patterns from each partition are
eventually merged.
Cloud computing and cluster computing, which use computers in a distributed
and collaborative way to tackle very large-scale computational tasks, are also active
research themes in parallel data mining. In addition, the high cost of some data min-
ing processes and the incremental nature of input promote incremental data mining,
which incorporates new data updates without having to mine the entire data “from
scratch.” Such methods perform knowledge modiﬁcation incrementally to amend
and strengthen what was previously discovered.

63 64 65 66 67 68 69 70 71 72 73