Page 33 -

P. 33

12 1 Introduction

(Figure 1.7f) (Horn and Schunck 1981; Huang 1981; Lucas and Kanade 1981; Nagel 1986).
The early work in simultaneously recovering 3D structure and camera motion (see Chapter 7)
also began around this time (Ullman 1979; Longuet-Higgins 1981).
A lot of the philosophy of how vision was believed to work at the time is summarized
8
in David Marr’s (1982) book. In particular, Marr introduced his notion of the three levels
of description of a (visual) information processing system. These three levels, very loosely
paraphrased according to my own interpretation, are:

• Computational theory: What is the goal of the computation (task) and what are the
constraints that are known or can be brought to bear on the problem?

• Representations and algorithms: How are the input, output, and intermediate infor-
mation represented and which algorithms are used to calculate the desired result?

• Hardware implementation: How are the representations and algorithms mapped onto
actual hardware, e.g., a biological vision system or a specialized piece of silicon? Con-
versely, how can hardware constraints be used to guide the choice of representation
and algorithm? With the increasing use of graphics chips (GPUs) and many-core ar-
chitectures for computer vision (see Section C.2), this question is again becoming quite
relevant.

As I mentioned earlier in this introduction, it is my conviction that a careful analysis of the
problem speciﬁcation and known constraints from image formation and priors (the scientiﬁc
and statistical approaches) must be married with efﬁcient and robust algorithms (the engineer-
ing approach) to design successful vision algorithms. Thus, it seems that Marr’s philosophy
is as good a guide to framing and solving problems in our ﬁeld today as it was 25 years ago.

1980s. In the 1980s, a lot of attention was focused on more sophisticated mathematical
techniques for performing quantitative image and scene analysis.
Image pyramids (see Section 3.5) started being widely used to perform tasks such as im-
age blending (Figure 1.8a) and coarse-to-ﬁne correspondence search (Rosenfeld 1980; Burt
and Adelson 1983a,b; Rosenfeld 1984; Quam 1984; Anandan 1989). Continuous versions
of pyramids using the concept of scale-space processing were also developed (Witkin 1983;
Witkin, Terzopoulos, and Kass 1986; Lindeberg 1990). In the late 1980s, wavelets (see Sec-
tion 3.5.4) started displacing or augmenting regular image pyramids in some applications
(Adelson, Simoncelli, and Hingorani 1987; Mallat 1989; Simoncelli and Adelson 1990a,b;
Simoncelli, Freeman, Adelson et al. 1992).
The use of stereo as a quantitative shape cue was extended by a wide variety of shape-
from-X techniques, including shape from shading (Figure 1.8b) (see Section 12.1.1 and Horn
1975; Pentland 1984; Blake, Zimmerman, and Knowles 1985; Horn and Brooks 1986, 1989),
photometric stereo (see Section 12.1.1 and Woodham 1981), shape from texture (see Sec-
tion 12.1.2 and Witkin 1981; Pentland 1984; Malik and Rosenholtz 1997), and shape from
focus (see Section 12.1.3 and Nayar, Watanabe, and Noguchi 1995). Horn (1986) has a nice
discussion of most of these techniques.

8 More recent developments in visual perception theory are covered in (Palmer 1999; Livingstone 2008).

28 29 30 31 32 33 34 35 36 37 38