Page 63 -
P. 63

HAN 08-ch01-001-038-9780123814791


          26    Chapter 1 Introduction                       2011/6/1  3:12  Page 26  #26



                           You can see there are many similarities between data mining and machine learning.
                         For classification and clustering tasks, machine learning research often focuses on the
                         accuracy of the model. In addition to accuracy, data mining research places strong
                         emphasis on the efficiency and scalability of mining methods on large data sets, as well
                         as on ways to handle complex types of data and explore new, alternative methods.


                   1.5.3 Database Systems and Data Warehouses
                         Database systems research focuses on the creation, maintenance, and use of databases
                         for organizations and end-users. Particularly, database systems researchers have estab-
                         lished highly recognized principles in data models, query languages, query processing
                         and optimization methods, data storage, and indexing and accessing methods. Database
                         systems are often well known for their high scalability in processing very large, relatively
                         structured data sets.
                           Many data mining tasks need to handle large data sets or even real-time, fast stream-
                         ing data. Therefore, data mining can make good use of scalable database technologies to
                         achieve high efficiency and scalability on large data sets. Moreover, data mining tasks can
                         be used to extend the capability of existing database systems to satisfy advanced users’
                         sophisticated data analysis requirements.
                           Recent database systems have built systematic data analysis capabilities on database
                         data using data warehousing and data mining facilities. A data warehouse integrates
                         data originating from multiple sources and various timeframes. It consolidates data
                         in multidimensional space to form partially materialized data cubes. The data cube
                         model not only facilitates OLAP in multidimensional databases but also promotes
                         multidimensional data mining (see Section 1.3.2).


                   1.5.4 Information Retrieval
                         Information retrieval (IR) is the science of searching for documents or information
                         in documents. Documents can be text or multimedia, and may reside on the Web. The
                         differences between traditional information retrieval and database systems are twofold:
                         Information retrieval assumes that (1) the data under search are unstructured; and (2)
                         the queries are formed mainly by keywords, which do not have complex structures
                         (unlike SQL queries in database systems).
                           The typical approaches in information retrieval adopt probabilistic models. For
                         example, a text document can be regarded as a bag of words, that is, a multiset of words
                         appearing in the document. The document’s language model is the probability density
                         function that generates the bag of words in the document. The similarity between two
                         documents can be measured by the similarity between their corresponding language
                         models.
                           Furthermore, a topic in a set of text documents can be modeled as a probability dis-
                         tribution over the vocabulary, which is called a topic model. A text document, which
                         may involve one or multiple topics, can be regarded as a mixture of multiple topic mod-
                         els. By integrating information retrieval models and data mining techniques, we can find
   58   59   60   61   62   63   64   65   66   67   68