Page 212 - Foundations of Cognitive Psychology : Core Readings
P. 212

The Auditory Scene  217

               tens it to the sides of the channel. As waves reach the side of the lake they
               travel up the channels and cause the two handkerchiefs to go into motion. You
               are allowed to look only at the handkerchiefs and from their motions to answer
               a series of questions: How many boats are there on the lake and where are
               they? Which is the most powerful one? Which one is closer? Is the wind blow-
               ing? Has any large object been dropped suddenly into the lake?
                 Solving this problem seems impossible, but it is a strict analogy to the prob-
               lem faced by our auditory systems. The lake represents the lake of air that sur-
               rounds us. The two channels are our two ear canals, and the handkerchiefs are
               our ear drums. The only information that the auditory system has available to
               it, or ever will have, is the vibrations of these two ear drums. Yet it seems to be
               able to answer questions very like the ones that were asked by the side of the
               lake: How many people are talking? Which one is louder, or closer? Is there a
               machine humming in the background? We are not surprised when our sense of
               hearing succeeds in answering these questions any more than we are when our
               eye, looking at the handkerchiefs, fails.
                 The difficulty in the examples of the lake, the infant, the sequence of letters,
               and the block drawings is that the evidence arising from each distinct physical
               cause in the environment is compounded with the effects of the other ones
               when it reaches the sense organ. If correct perceptual representations of the
               world are to be formed, the evidence must be partitioned appropriately.
                 In vision, you can describe the problem of scene analysis in terms of the
               correctgroupingofregions.Mostpeopleknowthatthe retina of theeye acts
               something like a sensitive photographic film and that it records, in the form of
               neural impulses, the ‘‘image’’ that has been written onto it by the light. This
               image has regions. Therefore, it is possible to imagine some process that groups
               them. But what about the sense of hearing? What are the basic parts that must
               be grouped to make a sound?
                 Rather than considering this question in terms of a direct discussion of the
               auditory system, it will be simpler to introduce the topic by looking at a spec-
               trogram, a widely used description of sound. Figure 9.3 shows one for the
               spoken word ‘‘shoe.’’ The picture is rather like a sheet of music. Time proceeds
               from left to right, and the vertical dimension represents the physical dimension
               of frequency, which corresponds to our impression of the highness of the sound.
               The sound of a voice is complex. At any moment of time, the spectrogram
               shows more than one frequency. It does so because any complex sound can
               actually be viewed as a set of simultaneous frequency components. A steady
               pure tone, which is much simpler than a voice, would simply be shown as a
               horizontal line because at any moment it would have only one frequency.
                 Once we see that the sound can be made into a picture, we are tempted to
               believe that such a picture could be used by a computer to recognize speech
               sounds. Different classes of speech sounds, stop consonants such as ‘‘b’’ and
               fricatives such as ‘‘s’’ for example, have characteristically different appearances
               on the spectrogram. We ought to be able to equip the computer with a set of
               testswithwhich to examinesuchapictureand to determinewhether theshape
               representing a particular speech sound is present in the image. This makes the
               problem sound much like the one faced by vision in recognizing the blocks in
               figure 9.2.
   207   208   209   210   211   212   213   214   215   216   217