Understanding Scene Analysis: The Role of Attention in Auditory Perception
Attention plays a crucial role in how we perceive and understand the environment. Imagine walking down a bustling city street where a cacophony of sounds converges: car horns, airplanes, construction, and people talking. The sound waves reaching your ears are a mixture of overlapping sources. How does the brain manage to select and focus on a single voice amidst the environmental din?
This article explores the definition of scene analysis, particularly focusing on auditory scene analysis in adults with normal hearing, summarizing a talk given at the 2016 American Speech-Language-Hearing Association (ASHA) Research Symposium. It examines how attention interacts with stimulus-driven processes to facilitate task goals.

Diagram of the Auditory System
The Challenge of Unattended Sounds
It remains unclear what processes occur automatically through stimulus-driven mechanisms and how attention modifies neural activity to support scene analysis. William James suggested that we can only attend to one thing at a time, which led to the development of limited-capacity models of attention by 20th-century psychologists. When selecting one sound source from many, what happens to the unattended sounds?
There has been much debate about the extent to which unattended sensory input is processed. Broadbent initially proposed an early selection filter, where unattended inputs undergo limited processing, focusing only on sound features. Others suggested that all inputs are fully processed but forgotten if not used. Kahneman's limited-capacity model suggested that the complexity of the input affects the processing of unattended input, given attention is a limited resource. However, when, where in the processing hierarchy, and how complexity affects the processing of unattended sensory inputs is still not well understood.
One reason for this lack of understanding is the difficulty in directly measuring responses to unattended sensory stimulation. Behavioral measures can deduce the fate of the unattended by quantifying its influence on performance, but they don't provide a direct measure of processing. Event-related brain potentials (ERPs) provide a direct, quantifiable measure of brain activity to both attended and unattended stimuli during their presentation.
Mismatch Negativity (MMN) as a Tool
One of the challenges in understanding sound processing in noisy environments with competing sources is quantifying how unattended sounds are represented in memory when attention is focused on a subset of sensory input. ERPs, time-locked to specific stimulus events and extracted from electroencephalography (EEG), offer a unique opportunity to observe brain responses to both attended and unattended information during selective listening.
A particularly useful ERP component for assessing auditory scene analysis processes is mismatch negativity (MMN). MMN is elicited by detected sound violations. The repetition of a sound or pattern sets the basis for deviance detection, and sound input that violates this pattern elicits an MMN. Therefore, sound change detection depends on the standard representation held in auditory memory. Change detection is based on the organization of sounds in the larger context, not simply on individual features.
MMN offers several advantages as a tool for probing the sound trace:
- It is modality-specific, generated within auditory cortices.
- It is elicited regardless of whether attention is focused on the sounds.
- It is an index of how sounds are held in memory.
- It is distinguishable from non-modality-specific responses associated with attention and target detection.
- It is highly context-dependent.

Schematic Model of Attention Effects on Auditory Scene Analysis
Stimulus-Driven Processes and Stream Segregation
Auditory processes driven by the stimulus characteristics of the input, independent of attentional manipulation, are called stimulus-driven or "bottom-up". For example, when you enter a cocktail party, the initial processing that occurs before directing attention to any specific sound event involves how the sound is represented in memory amid the background noise.
Sussman and colleagues demonstrated that when ears are presented with a mixture of sound frequencies irrelevant to the main task (e.g., reading a book), sounds are structured and organized by distinct frequency streams in auditory memory. Attention was not required to drive the initial segregation of sounds to streams. Stream segregation occurred automatically based on the bottom-up spectrotemporal characteristics of the input. These data support Bregman's hypothesis that stream segregation is a "primitive process" of audition, predicting that within-stream events would form after the initial segregation of the global mixture of sounds to streams.
This prediction was tested in the timing of auditory event formation within a streaming paradigm. In a previous study using a single stream, event formation was determined by whether one or two MMNs were elicited by a "double-deviant" stimulus (i.e., two deviant stimuli presented successively). The same double-deviant stimulus was presented in different sound contexts. The sound context influenced within-stream event formation and affected whether one or two MMNs were elicited by the successive deviant stimuli. This manipulation of contextual cues was used in a streaming paradigm in which an alternation of the tones would preclude the within-stream context effects. Only when the streams were physiologically segregated would the within-stream context exert influence and have an effect on MMNs' elicitation by the double deviants.
Results demonstrated context effects as were found for the single-stream paradigm: one MMN elicited by double deviants in the blocked condition and two MMNs elicited by double deviants in the mixed condition. Thus, the sound context influenced within-stream event formation, indicating that stream segregation occurs first, and within-stream events are formed on the already segregated streams. This automatic level of sound organization thus plays an important role in auditory scene analysis.
The real-world implication is that when you enter a noisy room, sounds are sorted based on stimulus characteristics and represented in memory as distinct sound streams. Stream segregation occurs first, and then sound events are detected and identified on the already sorted streams. Attention, a limited resource, can then focus on and process the within-stream events of the already formed streams (e.g., to comprehend the speech stream). Attentional resources are conserved when some level of sorting occurs by automatic processes. These results provide evidence for multiple stages of processing on unattended sounds-both the segregation of sounds to streams and the integration of within-stream events to perceptual units.

Frequency Separation as a Cue for Stream Segregation
The Role of Attention in Refining Scene Analysis
In addition to stimulus-driven processes, attention is needed to refine scene analysis to highlight what we perceive. Attention interacts with passive processes and plays multiple roles in auditory scene analysis to facilitate task goals. When you walk into a lively cocktail party, a level of sound organization occurs-brain mechanisms disentangle the sound input to form identifiable sound streams based on the mixture of sound input that enters the ears (e.g., a person talking, glasses clinking, music playing).
Attention interacts with the stimulus-driven processes to sharpen the stream segregation process. Recent study demonstrated that attention could effectively segregate sounds that were not segregated automatically when the same sounds were in the background and irrelevant to the task. Participants were presented with several conditions of alternating tones that differed in the frequency difference (Δƒ) between the lower (440 Hz) and higher frequency tones. Two conditions of attention were compared: active and passive. In the active condition, the task was to listen to the lower frequency tones (440 Hz), ignore the higher frequency sounds, and press the response key whenever a louder intensity tone occurred randomly among the lower frequency tones.
When participants selected the low set of sounds to perform the task (active listening), they could segregate the sounds at a smaller frequency separation than what occurred automatically during the passive listening condition. These results demonstrate that attention can refine the stream segregation process by modulating the stream formation process.