Ap Cam

Find The Best Tech Web Designs & Digital Insights

Technology and Design

Auditory Scene Analysis: The Perceptual Organization of Sound

Auditory Scene Analysis (ASA) delves into how our auditory systems construct a coherent representation of the acoustic environment from a complex mixture of sounds. This field, significantly advanced by Albert S. Bregman, explores the processes by which we segregate and group auditory elements to perceive distinct sound sources and events.

Auditory Scene Analysis Diagram

The Foundations of Auditory Scene Analysis

In the late 1960s, Albert Bregman began researching the perceptual organization of sound, initially believing the topic to be well-explored. However, he discovered a relative lack of attention to audition compared to vision. This led him to explore auditory stream segregation, later termed "streaming" by Ulric Neisser. Bregman's willingness to explore uncharted territory, influenced by his teacher Neil Miller, led to significant advancements in the field.

Bregman's work culminated in the development of a comprehensive framework for understanding how we parse and interpret the auditory world. For years Bregman thought of writing a book, and it was John Macnamara, a colleague at McGill, who convinced him to actually do it. Fortunately, Bregman was awarded a two-year research fellowship by the Killam Foundation to do so, and the publishing arrangement was soon concluded with The MIT Press.

Key Concepts and Phenomena

ASA addresses the challenge of how our auditory systems build a picture of the world through their sensitivity to sound. It is not entirely true that textbooks ignore complex perceptual phenomena in audition. My purpose in this book is to try to see them as oblique glimpses of a general auditory process of organization that has evolved, in our auditory systems, to solve a problem that I will refer to as "auditory scene analysis."

Auditory Stream Segregation

Auditory stream segregation got in the way of a study that I was trying to do on auditory learning and I decided to follow Miller's advice. I thought of it as a detour at the time, but the detour has occupied about 20 years. Gradually, a body of research has accumulated, both in my laboratory and elsewhere, and I have developed a way of looking at it.

Auditory stream segregation (or streaming) refers to the perceptual separation of a complex sound into distinct auditory streams or sequences. This phenomenon demonstrates how our auditory system organizes sounds based on features like frequency, timing, and location.

Timbre Constancy

Timbre constancy is another example of a complex auditory phenomenon. A friend's voice has the same perceived timbre in a quiet room as at a cocktail party. Yet at the party, the set of frequency components arising from that voice is mixed at the listener's ear with frequency components from other sources. The total spectrum of energy that reaches the ear may be quite different in different environments. To recognize the unique timbre of the voice we have to isolate the frequency components that are responsible for it from others that are present at the same time. A wrong choice of frequency components would change the perceived timbre of the voice. Just as in the case of the visual constancies, timbre constancy will have to be explained in terms of a complicated analysis by the brain, and not merely in terms of a simple registration of the input by the brain.

There are some practical reasons for trying to understand this constancy. There are engineers currently trying to design computers that can understand what a person is saying. However, in a noisy environment the speaker's voice comes mixed with other sounds. To a naive computer, each different sound that the voice comes mixed with makes it sound as if different words were being spoken or as if they were spoken by a different person. The machine cannot correct for the particular listening conditions as a human can. If the study of human audition were able to lay bare the principles that govern the human skill, there is some hope that a computer could be designed to mimic it.

The Problem of Scene Analysis

Let me clarify what I mean by auditory scene analysis. The best way to begin is to ask ourselves what perception is for. Since Aristotle, many philosophers and psychologists have believed that per- ception is the process of using the information provided by our senses to form mental representations of the world around us. In using the word representations, we are implying the existence of a two-part system: one part forms the representations and another uses them to do such things as calculate appropriate plans and actions. The job of perception, then, is to take the sensory input and to derive a useful representation of reality from it. An important part of building a representation is to decide which parts of the sensory stimulation are telling us about the same environmental object or event. Unless we put the right combination of sensory evidence together, we will not be able to recognize what is going on.

Gestalt Principles in Auditory Perception

Demonstrations of Auditory Scene Analysis

To illustrate the principles of ASA, Albert S. Bregman and Pierre Ahad created a compact disc of audio demonstrations.

These demonstrations are derived from the following audio compact disk:Bregman, A.S., & Ahad, P. (1996) Demonstrations of auditory scene analysis: The perceptual organization of sound. Auditory Perception Laboratory, McGill University.

In 1996, Pierre Ahad and I published a compact disk of audio demonstrations, illustrating many examples of auditory perceptual organization. Bregman, A.S., & Ahad, P. (1996) Demonstrations of Auditory Scene Analysis: The Perceptual Organization of Sound. Audio Compact disk. It was packaged with a booklet that explained the demonstrations.This CD, in 16-bit PCM audio, may be ordered from MIT Press. This material is copyright protected. You are welcome to download it.

These demonstrations provide concrete examples of how different auditory cues influence our perception of sound scenes. The CD includes examples of auditory stream segregation, demonstrating how sequences of tones can be perceived as separate streams based on their frequency or timing relationships.

The Role of Vision in Auditory Perception

A simple example is shown in the top line of figure 1.1. Bottom line: the component messages are segregated by visual factors. (From Bregman 1981b.) the two cannot be separated. However, if, as in the lower line of the figure, we give the eyes some assistance, the meaning becomes apparent. This business of separating evidencé has been faced in the design of computer systems for recognizing the objects in natural scenes or in drawings. Figure 1.2 shows a line drawing of some blocks.2 We can imagine that the picture has been translated into a pattern in the mem- ory of the computer by some process that need not concern us. We might think that once it was entered, all that we would have to do to enable the computer to decide which objects were present in the scene would be to supply it with a description of the shape of each possible one. But the problem is not as easy as all that. Before the machine could make any decision, it would have to be able to tell which parts of the picture represented parts of the same object. To our human eyes it appears that the regions labeled A and B are parts of a single block. This is not immediately obvious to a computer. In simple line drawings there is a rule that states that any white area totally surrounded by lines must depict a single surface. This rule implies that in figure 1.2 the whole of region A is part of a single surface. The reason for grouping region A with B is much more complex. The question of how it can be done can be set aside for the moment. The point of the example is that unless regions A and B are indeed considered part of a single object, the description that the computer will be able to construct will not be correct and the elongated shape formed out of A, B, and other regions will not be seen. It seems as though a preliminary step along the road to recognition would be to program the computer to do the equivalent of taking a set of crayons and coloring in, with the same color, all those regions that were parts of the same block. Then some subsequent recognition process could simply try to form a description of a single shape from each set in which the regions were the same color. This allo...

The Significance of ASA

The study of ASA has important implications for various fields, including:

  • Speech Recognition: Improving the ability of computers to understand speech in noisy environments.
  • Music Perception: Understanding how we perceive and appreciate music.
  • Hearing Aid Design: Developing hearing aids that can better separate and amplify relevant sounds.

ASA helps us understand how humans perceive and organize concurrent auditory stimuli. It highlights the importance of sequential streaming and auditory object formation while also acknowledging the limitations of purely perceptual approaches without quantitative backing.

The earlier development of sophisticated thinking in the field of visual perception may also have been due to the fact that it was much easier to create a visual display with exactly specified properties than it was to shape sound in equally exact ways. If so, the present-day development of the computer analysis and synthesis of sound ought to greatly accelerate the study of auditory perception.

By a per- ceptual question I mean one that asks how our auditory systems could build a picture of the world around us through their sensitivity to sound, whereas by an ecological one I am referring to one that asks how our environment tends to create and shape the sound around us. (The two kinds of questions are related. Only by being aware of how the sound is created and shaped in the world can we know how to use it to derive the properties of the sound-producing events around us.) Instead, you would find discussions of such basic auditory qualities as loudness and pitch. For each of these, the textbook might discuss the psychophysical question: which physical property of the sound gives rise to the perceptual quality that we experience? It might also consider the question of how the physiology of the ear and nervous system could respond to those properties of sound.

The most perceptual of the topics that you might encounter would be concerned with how the sense of hearing can tell the listener where sounds are coming from. Under this heading, some consideration would be given to the role of audition in telling us about the world around us. For the most part, instead of arising from everyday life, the motivation of much of the research on audition seems to have its origins in the medical study of deafness, where the major concerns are the sensitivity of the auditory system to weak sounds, the growth in perceived intensity with increases in the energy of the signal, and the effects of exposure to noise.

The situation would be quite different in the treatment of vision. It is true that you would see a treatment of psychophysics and physiology, and indeed there would be some consideration of such deficits as colorblindness, but this would not be the whole story. There would, for example, be a description of size constancy, the fact that we tend to see the size of an object as unchanged when it is at a different distance, despite the fact that the image that it projects on our retinas shrinks as it moves further away. Apparently some complex analysis by the brain takes into account clues other than retinal size in arriving at the perceived size of an object.

Why should there be such a difference? A proponent of the "great man" theory of history might argue that it was because the fathers of Gestalt psychology, who opened up the whole question of perceptual organization, had focused on vision and never quite got around to audition. However, it is more likely that there is a deeper reason. We came to know about the puzzles of visual perception through the arts of drawing and painting. The desire for accurate portrayal led to an understanding of the cues for distance and certain facts about projective geometry. This was accompanied by the development of the physical analysis of projected images, and eventually the invention of the camera. Early on, the psychologist was faced with the discrepancy between what was on the photograph or canvas and what the person saw.

Of course there is another possibility that explains the slighting of audition in the textbook: Perhaps audition is really a much simpler sense and there are no important perceptual phenomena like the visual constancies to be discovered. This is a notion that can be rejected. We can show that such complex phenomena as constancies exist in hearing, too. One example is timbre constancy.

Auditory Scene Analysis Explained

ASA Contributors

The ideas and findings discussed are the product of the cumulative work of many individuals. I have reworked these ideas and made up a slightly different story about them that makes sense to me, but it is clear that an entire research community has labored to gain an understanding of these problems for a good many years. I want to particularly acknowledge the stimulation that I have received from the research work and theoretical writing of Christopher J. Darwin, Diana Deutsch, W. Jay Dowling, Stephen Handel, Hermann von Helmholtz, Ira J. Hirsh, Mari R. Jones, Bela Julesz, George A. Miller, Brian C. J. Moore, Otto Ortmann, Irvin Rock, Richard M. Warren, Leo van Noorden, and Giovanni Vicario.

The work in my own laboratory has been advanced by the contributions of many students, assistants, and associates. It would be impossible to mention all of them, but I would like to mention the following with particular appreciation: Pierre Abdel Ahad, Jack Abramson, André Achim, Gary Bernstein, Jock Campbell, Valter Ciocca, Gary Dannenbring, Peter Doehring, Magda Chalikia, Lynn Halpern, Robert Levitan, Christine Liao, Stephen McAdams, Michael Mills, Steven Pinker, Brian Roberts, Wendy Rogers, Alexander Rudnicky, Howard Steiger, Jack Torobin, Yves Tougas, Tony Wolff, and James Wright.

I want to thank John Chowning for inviting me to the Center for Computer Research in Music and Acoustics to spend the summer of 1982 and a sabbatical year in 1986 and 1987. These pleasant and productive periods gave me a chance to become familiar with what the computer music community, especially John Pierce, Max Mathews, and John Chowning, had discovered about the perception of musical sound. I have also benefited from valuable discussions with other colleagues. These include Pierre Divenyi, Bernard Mont-Reynaud, Earl Schubert, William Schottstaedt, and Mitchell Weintraub. In addition, Alan Belkin, Valter Ciocca, Michael Cohen, Doug Kieslar, John Pierce, Martin Tenenbaum, and Meg Withgott were kind enough to read parts of the manuscript and give me their comments.

I am particularly indebted to the Killam Foundation for its two-year fellowship. I should also mention the Department of Psychology of McGill University, which has been a congenial place to work. Finally, it is impossible to express the debt that I owe to my wife, Abigail Elizabeth Sibley. She has put up with me for many years and, although a historian by trade, has entered into my professional milieu with gusto, earning the affection and respect of my colleagues.