Page 232 - Matrix Analysis & Applied Linear Algebra
P. 232
4.6 Classical Least Squares 227
Applying these rules to the function in (4.6.3) produces
∂f ∂x T T T T ∂x ∂x T T
= A Ax + x A A − 2 A b.
∂x i ∂x i ∂x i ∂x i
Since ∂x/∂x i = e i (the i th unit vector), we have
∂f T T T T T T T T T T
= e A Ax + x A Ae i − 2e A b =2e A Ax − 2e A b.
i
i
i
i
∂x i
T
T
Using e A = A T i∗ and setting ∂f/∂x i = 0 produces the n equations
i
T T
A Ax = A b for i =1, 2,...,n,
i∗ i∗
T
T
which can be written as the single matrix equation A Ax = A b. Calculus
guarantees that the minimum value of f occurs at some solution of this system.
T
T
But this is not enough—we want to know that every solution of A Ax = A b
is a least squares solution. So we must show that the function f in (4.6.3) attains
T
T
its minimum value at each solution to A Ax = A b. Observe that if z is a
T
T
T
solution to the normal equations, then f(z)= b b − z A b. For any other
n×1
y ∈ , let u = y − z, so y = z + u, and observe that
T
f(y)= f(z)+ v v, where v = Au.
T
2
Since v v = v ≥ 0, it follows that f(z) ≤ f(y) for all y ∈ n×1 , and
i i
thus f attains its minimum value at each solution of the normal equations. The
remaining statements in the theorem follow from the properties established on
p. 213.
The classical least squares problem discussed at the beginning of this sec-
tion and illustrated in Example 4.6.1 is part of a broader topic known as linear
regression, which is the study of situations where attempts are made to express
one variable y as a linear combination of other variables t 1 ,t 2 ,...,t n . In prac-
tice, hypothesizing that y is linearly related to t 1 ,t 2 ,...,t n means that one
assumes the existence of a set of constants {α 0 ,α 1 ,...,α n } (called parameters)
such that
y = α 0 + α 1 t 1 + α 2 t 2 + ··· + α n t n + ε,
where ε is a “random function” whose values “average out” to zero in some
sense. Practical problems almost always involve more variables than we wish to
consider, but it is frequently fair to assume that the effect of variables of lesser
significance will indeed “average out” to zero. The random function ε accounts
for this assumption. In other words, a linear hypothesis is the supposition that
the expected (or mean) value of y at each point where the phenomenon can be
observed is given by a linear equation
E(y)= α 0 + α 1 t 1 + α 2 t 2 + ··· + α n t n .