Page 68 -
P. 68

3:12
                                                                           Page 31
                                                                                   #31
                                                             2011/6/1
                          HAN 08-ch01-001-038-9780123814791
                                                                    1.7 Major Issues in Data Mining  31


                                 into the knowledge discovery process. Such knowledge can be used for pattern
                                 evaluation as well as to guide the search toward interesting patterns.
                                 Ad hoc data mining and data mining query languages: Query languages (e.g., SQL)
                                 have played an important role in flexible searching because they allow users to pose
                                 ad hoc queries. Similarly, high-level data mining query languages or other high-level
                                 flexible user interfaces will give users the freedom to define ad hoc data mining tasks.
                                 This should facilitate specification of the relevant sets of data for analysis, the domain
                                 knowledge, the kinds of knowledge to be mined, and the conditions and constraints
                                 to be enforced on the discovered patterns. Optimization of the processing of such
                                 flexible mining requests is another promising area of study.
                                 Presentation and visualization of data mining results: How can a data mining system
                                 present data mining results, vividly and flexibly, so that the discovered knowledge
                                 can be easily understood and directly usable by humans? This is especially crucial
                                 if the data mining process is interactive. It requires the system to adopt expressive
                                 knowledge representations, user-friendly interfaces, and visualization techniques.



                         1.7.3 Efficiency and Scalability
                               Efficiency and scalability are always considered when comparing data mining algo-
                               rithms. As data amounts continue to multiply, these two factors are especially critical.

                                 Efficiency and scalability of data mining algorithms: Data mining algorithms must be
                                 efficient and scalable in order to effectively extract information from huge amounts
                                 of data in many data repositories or in dynamic data streams. In other words, the
                                 running time of a data mining algorithm must be predictable, short, and acceptable
                                 by applications. Efficiency, scalability, performance, optimization, and the ability to
                                 execute in real time are key criteria that drive the development of many new data
                                 mining algorithms.
                                 Parallel, distributed, and incremental mining algorithms: The humongous size of many
                                 data sets, the wide distribution of data, and the computational complexity of some
                                 data mining methods are factors that motivate the development of parallel and dis-
                                 tributed data-intensive mining algorithms. Such algorithms first partition the data
                                 into “pieces.” Each piece is processed, in parallel, by searching for patterns. The par-
                                 allel processes may interact with one another. The patterns from each partition are
                                 eventually merged.
                                    Cloud computing and cluster computing, which use computers in a distributed
                                 and collaborative way to tackle very large-scale computational tasks, are also active
                                 research themes in parallel data mining. In addition, the high cost of some data min-
                                 ing processes and the incremental nature of input promote incremental data mining,
                                 which incorporates new data updates without having to mine the entire data “from
                                 scratch.” Such methods perform knowledge modification incrementally to amend
                                 and strengthen what was previously discovered.
   63   64   65   66   67   68   69   70   71   72   73