Page 423 - Matrix Analysis & Applied Linear Algebra
P. 423

5.12 Singular Value Decomposition                                                  419
                   Example 5.12.4

                                    Search Engines. The filtering idea presented in Example 5.12.3 is widely used,
                                    but a particularly novel application is the method of latent semantic indexing
                                    used in the areas of information retrieval and text mining. You can think of
                                    this in terms of building an Internet search engine. Start with a dictionary of
                                    terms T 1 ,T 2 ,...,T m . Terms are usually single words, but sometimes a term
                                    may contain more that one word such as “landing gear.” It’s up to you to decide
                                    how extensive your dictionary should be, but even if you use the entire English
                                    language, you probably won’t be using more than a few hundred-thousand terms,
                                    and this is within the capacity of existing computer technology. Each document
                                    (or web page) D j of interest is scanned for key terms (this is called indexing the
                                    document), and an associated document vector d j =(freq , freq ,..., freq mj ) T
                                                                                           2j
                                                                                      1j
                                    is created in which
                                            freq =number of times term T i occurs in document D j .
                                                ij
                                    (More sophisticated search engines use weighted frequency strategies.) After a
                                    collection of documents D 1 ,D 2 ,...,D n has been indexed, the associated docu-
                                    ment vectors d j are placed as columns in a term-by-document matrix

                                                                         D 1     D 2   ···   D n
                                                                                                
                                                                   T 1  freq 11  freq 12  ···  freq 1n
                                                                   T 2    freq  freq  ···  freq  
                                                                          21      22          2n 
                                         A m×n = d 1 | d 2 ··· | d n =  .    .   .           .  .
                                                                    .     .       .           .
                                                                    .    .       .           .  
                                                                   T m  freq m1  freq m2  ··· freq mn
                                    Naturally, most entries in each document vector d j will be zero, so A is a
                                    sparse matrix—this is good because it means that sparse matrix technology can
                                    be applied. When a query composed of a few terms is submitted to the search
                                                         T
                                    engine, a query vector q =(q 1 ,q 2 ,...,q n )is formed in which
                                                        5  1if term T i appears in the query,
                                                    q i =
                                                          0  otherwise.
                                    (The q i ’s might also be weighted.) To measure how well a query q matches a
                                    document D j , we check how close q is to d j by computing the magnitude of

                                                                  T           T
                                                                 q d j       q Ae j
                                                      cos θ j =         =             .          (5.12.12)
                                                               q   d j      q   Ae j
                                                                 2     2      2      2
                                    If | cos θ j |≥ τ for some threshold tolerance τ, then document D j is con-
                                    sidered relevant and is returned to the user. Selecting τ is part art and part
                                    science that’s based on experimentation and desired performance criteria. If the
                                    columns of A along with q are initially normalized to have unit length, then
   418   419   420   421   422   423   424   425   426   427   428