Page 65 -
P. 65

HAN 08-ch01-001-038-9780123814791


          28    Chapter 1 Introduction                       2011/6/1  3:12  Page 28  #28



                   1.6.2 Web Search Engines
                         A Web search engine is a specialized computer server that searches for information
                         on the Web. The search results of a user query are often returned as a list (sometimes
                         called hits). The hits may consist of web pages, images, and other types of files. Some
                         search engines also search and return data available in public databases or open directo-
                         ries. Search engines differ from web directories in that web directories are maintained
                         by human editors whereas search engines operate algorithmically or by a mixture of
                         algorithmic and human input.
                           Web search engines are essentially very large data mining applications. Various data
                         mining techniques are used in all aspects of search engines, ranging from crawling 5
                         (e.g., deciding which pages should be crawled and the crawling frequencies), indexing
                         (e.g., selecting pages to be indexed and deciding to which extent the index should be
                         constructed), and searching (e.g., deciding how pages should be ranked, which adver-
                         tisements should be added, and how the search results can be personalized or made
                         “context aware”).
                           Search engines pose grand challenges to data mining. First, they have to handle a
                         huge and ever-growing amount of data. Typically, such data cannot be processed using
                         one or a few machines. Instead, search engines often need to use computer clouds, which
                         consist of thousands or even hundreds of thousands of computers that collaboratively
                         mine the huge amount of data. Scaling up data mining methods over computer clouds
                         and large distributed data sets is an area for further research.
                           Second, Web search engines often have to deal with online data. A search engine
                         may be able to afford constructing a model offline on huge data sets. To do this, it may
                         construct a query classifier that assigns a search query to predefined categories based on
                         the query topic (i.e., whether the search query “apple” is meant to retrieve information
                         about a fruit or a brand of computers). Whether a model is constructed offline, the
                         application of the model online must be fast enough to answer user queries in real time.
                           Another challenge is maintaining and incrementally updating a model on fast-
                         growing data streams. For example, a query classifier may need to be incrementally
                         maintained continuously since new queries keep emerging and predefined categories
                         and the data distribution may change. Most of the existing model training methods are
                         offline and static and thus cannot be used in such a scenario.
                           Third, Web search engines often have to deal with queries that are asked only a very
                         small number of times. Suppose a search engine wants to provide context-aware query
                         recommendations. That is, when a user poses a query, the search engine tries to infer
                         the context of the query using the user’s profile and his query history in order to return
                         more customized answers within a small fraction of a second. However, although the
                         total number of queries asked can be huge, most of the queries may be asked only once
                         or a few times. Such severely skewed data are challenging for many data mining and
                         machine learning methods.


                         5
                         A Web crawler is a computer program that browses the Web in a methodical, automated manner.
   60   61   62   63   64   65   66   67   68   69   70