Page 423 - Matrix Analysis & Applied Linear Algebra
P. 423
5.12 Singular Value Decomposition 419
Example 5.12.4
Search Engines. The filtering idea presented in Example 5.12.3 is widely used,
but a particularly novel application is the method of latent semantic indexing
used in the areas of information retrieval and text mining. You can think of
this in terms of building an Internet search engine. Start with a dictionary of
terms T 1 ,T 2 ,...,T m . Terms are usually single words, but sometimes a term
may contain more that one word such as “landing gear.” It’s up to you to decide
how extensive your dictionary should be, but even if you use the entire English
language, you probably won’t be using more than a few hundred-thousand terms,
and this is within the capacity of existing computer technology. Each document
(or web page) D j of interest is scanned for key terms (this is called indexing the
document), and an associated document vector d j =(freq , freq ,..., freq mj ) T
2j
1j
is created in which
freq =number of times term T i occurs in document D j .
ij
(More sophisticated search engines use weighted frequency strategies.) After a
collection of documents D 1 ,D 2 ,...,D n has been indexed, the associated docu-
ment vectors d j are placed as columns in a term-by-document matrix
D 1 D 2 ··· D n
T 1 freq 11 freq 12 ··· freq 1n
T 2 freq freq ··· freq
21 22 2n
A m×n = d 1 | d 2 ··· | d n = . . . . .
. . . .
. . . .
T m freq m1 freq m2 ··· freq mn
Naturally, most entries in each document vector d j will be zero, so A is a
sparse matrix—this is good because it means that sparse matrix technology can
be applied. When a query composed of a few terms is submitted to the search
T
engine, a query vector q =(q 1 ,q 2 ,...,q n )is formed in which
5 1if term T i appears in the query,
q i =
0 otherwise.
(The q i ’s might also be weighted.) To measure how well a query q matches a
document D j , we check how close q is to d j by computing the magnitude of
T T
q d j q Ae j
cos θ j = = . (5.12.12)
q d j q Ae j
2 2 2 2
If | cos θ j |≥ τ for some threshold tolerance τ, then document D j is con-
sidered relevant and is returned to the user. Selecting τ is part art and part
science that’s based on experimentation and desired performance criteria. If the
columns of A along with q are initially normalized to have unit length, then