Page 424 - Matrix Analysis & Applied Linear Algebra
P. 424

420              Chapter 5                    Norms, Inner Products, and Orthogonality


                                      T
                                    |q A| = | cos θ 1 |, | cos θ 2 |,. . . , | cos θ n |  provides the information that allows
                                    the search engine to rank the relevance of each document relative to the query.
                                    However, due to things like variation and ambiguity in the use of vocabulary,
                                    presentation style, and even the indexing process, there is a lot of “noise” in
                                                         T
                                    A, so the results in |q A| are nowhere near being an exact measure of how
                                    well query q matches the various documents. To filter out some of this noise,
                                                                                             r       T
                                    the techniques of Example 5.12.3 are employed. An SVD A =   σ i u i v  is
                                                                                             i=1     i
                                    judiciously truncated, and
                                                                                       
                                                                                   v T
                                                                      σ 1             1      k
                                                    T                     .         .             T
                                                                                      .
                                       A k = U k D k V = u 1 |· · · | u k    . .     .  =  σ i u i v i
                                                    k
                                                                                     v T    i=1
                                                                              σ k
                                                                                      k
                                    is used in place of A in (5.12.12). In other words, instead of using cos θ j , query
                                    q is compared with document D j by using the magnitude of
                                                                        T
                                                                      q A k e j
                                                            cos φ j =           .
                                                                     q   A k e j
                                                                       2       2


                                                                                       T
                                    To make this more suitable for computation, set S k = D k V = s 1 | s 2 |· · · | s k ,
                                                                                       k
                                    and use

                                                                    T

                                                   A k e j   = U k D k V e j  
  =  U k s j   =  s j
                                                                    k
                                                         2               2         2      2
                                    to write
                                                                        T
                                                                      q U k s j
                                                              cos φ j =        .                 (5.12.13)
                                                                      q   s j
                                                                         2    2
                                    The vectors in U k and S k only need to be computed once (and they can be
                                    determined without computing the entire SVD), so (5.12.13) requires very little
                                    computation to process each new query. Furthermore, we can be generous in the
                                    number of SVD components that are dropped because variation in the use of
                                    vocabulary and the ambiguity of many words produces significant noise in A.
                                    Coupling this with the fact that numerical accuracy is not an important issue
                                    (knowing a cosine to two or three significant digits is sufficient) means that we
                                    are more than happy to replace the SVD of A by alow-rank truncation A k ,
                                    where k is significantly less than r.
                                    Alternate Query Matching Strategy.    An alternate way to measuring how
                                    close a given query q is to a document vector d j is to replace the query vector
                                    q in (5.12.12) by the projected query 9 q = P R(A) q, where P R(A) = U r U T r  is the
                                    orthogonal projector onto R (A) along R (A) ⊥  (Exercise 5.12.15) to produce
                                                                        T
                                                                       9 q Ae j
                                                             cos θ j =          .                (5.12.14)
                                                                9
                                                                     9 q   Ae j
                                                                        2     2
   419   420   421   422   423   424   425   426   427   428   429