Understanding Scene Analysis: The Role of Attention in Auditory Perception
Imagine yourself walking on a busy city street. Noises from the environment converge and digress: A car horn honks; a jet flies by; a jackhammer blasts; people talk as they walk past you. The sound waves reach your ears in a mixture of all the sources, overlapping in time. Attention plays an important role in how we understand and perceive the environment.
This review article provides a summary of a talk given at the 2016 American Speech-Language-Hearing Association (ASHA) Research Symposium on the role of attention in auditory scene analysis in adults with normal hearing. It is not clear what happens automatically by stimulus-driven processes or how attention modifies neural activity to support scene analysis. We do not fully understand what the brain does to facilitate the ability to select and listen to one voice in the midst of the din of environmental sounds.
The Challenge of Unattended Sounds
There has been much controversy regarding the degree to which unattended sensory input is processed. William James famously suggested that we can only attend to one thing at a time (James, 1890), a precursor to the limited-capacity models of attention advanced by 20th century psychologists (Kahneman, 1973). When selecting one of many competing sound sources, what then is the fate of the unattended? Broadbent (1958) originally proposed an early selection filter, in which the unattended inputs are subject to limited processing (e.g., only the features of the sound). Others later proposed that all inputs are fully processed, but the information is forgotten if not used (Deutsch & Deutsch, 1963). Kahneman's approach of a limited-capacity model, in contrast, suggested that, because attention is a limited resource, the complexity of the input influences the degree of processing of the unattended input.
However, it is still not well understood when, where in the processing hierarchy, or how complexity affects the processing of unattended sensory inputs. One reason is that it is difficult to directly measure responses to sensory stimulation that is not being attended. Behavioral measures can be used to deduce the fate of the unattended, such as by quantifying the influence of unattended information on behavioral performance, but do not provide a direct measure of processing.
Event-Related Brain Potentials (ERPs) as a Tool
Event-related brain potentials (ERPs) give a direct quantifiable measure of brain activity to both attended and unattended stimuli while the stimuli are being presented. One of the challenges in understanding how sounds are processed in noisy environments when there are competing sound sources is the ability to quantify how unattended sounds are being represented in memory when attention is used to select a subset of the sensory input. Thus, it is difficult to assess to what degree unattended sounds are processed.
ERPs, which are time-locked to specific stimulus events and extracted from the ongoing electroencephalography (EEG) record, provide a unique opportunity to observe brain responses to both attended and unattended information during selective listening. One particularly useful ERP component for assessing processes associated with auditory scene analysis is mismatch negativity (MMN).
Mismatch Negativity (MMN)
MMN is elicited by detected sound violations (Näätänen, Gaillard, & Mäntysalo, 1978; Squires, Squires, & Hillyard, 1975). The repetition of a sound, or pattern of sounds, sets the basis for deviance detection. Sound input that violates the repeated sound or pattern elicits an MMN. Therefore, sound change detection is dependent upon the standard representation held in auditory memory (Sussman, 2007). That is, change detection is based on the organization of the sounds in the larger context and not simply on individual features of the sounds (Alain, Achim, & Woods, 1999; Sussman & Gumenyuk, 2005; Sussman, Ritter, & Vaughan, 1998b, 1999; Sussman, Winkler, Huotilainen, Ritter, & Näätänen, 2002).
MMN as a tool for probing the sound trace has many advantages:
- It is modality specific-generated within auditory cortices (Alho, 1995; Giard, Perrin, Pernier, & Bouchet, 1990; Opitz, Mecklinger, Von Cramon, & Kruggel, 1999).
- It is elicited whether or not attention is focused on the sounds (Näätänen, Paavilainen, Tiitinen, Jiang, & Alho, 1993; Sussman, 2007; Sussman, Bregman, Wang, & Khan, 2005; Winkler, Czigler, Sussman, Horváth, & Balázs, 2005).
- It is an index of how sounds are held in memory (Javitt, Steinschneider, Schroeder, & Arezzo, 1996; Näätänen, Tervaniemi, Sussman, Paavilainen, & Winkler, 2001).
- It is distinguishable from non-modality-specific responses, those associated with attention and target detection (e.g., P3b component; Novak, Ritter, Vaughan, & Wiznitzer, 1990; Sussman, 2007).
- It is highly context dependent (Sussman, 2007; Sussman, Chen, Sussman-Fort, & Dinces, 2014).
Stimulus-Driven Processes (Bottom-Up)
Auditory processes that are driven by the stimulus characteristics of the input, independent of attentional manipulation, are generally called stimulus-driven or “bottom-up” (Figure 1a). For example, when you initially walk into a cocktail party it is the degree of processing that occurs before you have directed your attention to any one sound event-how the sound is represented in memory when you have no particular task with the background din.
Sussman and colleagues demonstrated that, when the ears were presented with a mixture of sound frequencies that were irrelevant to the main task (e.g., when attention was focused on reading a book), sounds were structured and organized by distinct frequency streams in auditory memory (Sussman, 2005; Sussman, Ritter, & Vaughan, 1999). Attention was not required to drive the initial segregation of sounds to streams. That is, stream segregation occurred automatically based on the bottom-up spectrotemporal characteristics of the input.
These data support a hypothesis advanced by Bregman (1990) that stream segregation is a “primitive process” of audition: a hypothesis that predicts that within-stream events would be formed after the initial segregation of the global mixture of sounds to streams. This prediction was tested in the timing of auditory event formation within a streaming paradigm (Sussman, 2005). In a previous study using a single stream (Sussman & Winkler, 2001; Sussman, Winkler, Kreuzer, et al., 2002; Sussman, Winkler, Ritter, Alho, & Naatanen, 1999), event formation was determined by whether one or two MMNs were elicited by a “double-deviant” stimulus (i.e., two deviant stimuli presented successively).
The same double-deviant stimulus was presented in different sound contexts. The sound context influenced within-stream event formation and affected whether one or two MMNs' elicitation by the double deviants (Sussman & Winkler, 2001). This manipulation of contextual cues was used in a streaming paradigm in which an alternation of the tones would preclude the within-stream context effects (see Figure 2). Only when the streams were physiologically segregated would the within-stream context exert influence and have an effect on MMNs' elicitation by the double deviants.
Results demonstrated context effects as were found for the single-stream paradigm: one MMN elicited by double deviants in the blocked condition and two MMNs elicited by double deviants in the mixed condition (see Figure 3). Thus, the sound context influenced within-stream event formation, indicating that stream segregation occurs first and within-stream events are formed on the already segregated streams.
This automatic level of sound organization thus plays an important role in auditory scene analysis (see Figure 1a). The real-world implication is that, when you walk into a noisy room, sounds are sorted on the basis of stimulus characteristics of the input and represented in memory as distinct sound streams. Stream segregation occurs first, and then sound events are detected and identified on the already sorted streams. Attention, which is a limited resource, can then be used to focus on and process the within-stream events of the already formed streams (e.g., to comprehend the speech stream). That is, attentional resources are conserved when some level of sorting occurs by automatic processes.
These results provide evidence for multiple stages of processing on unattended sounds-both the segregation of sounds to streams and the integration of within-stream events to perceptual units.

Schematic model of attention effects on auditory scene analysis.
The Role of Attention
In addition to stimulus-driven processes, attention is needed to refine scene analysis to highlight what we perceive. Attention interacts with passive processes and plays multiple roles in auditory scene analysis to facilitate task goals (Sussman, 2006). When you walk into a lively cocktail party, a level of sound organization occurs-brain mechanisms disentangle the sound input to form identifiable sound streams based on the mixture of sound input that enters the ears (e.g., a person talking, glasses clinking, music playing).
We found that attention interacts with the stimulus-driven processes to sharpen the stream segregation process (Figure 1b). A recent study demonstrated that attention could effectively segregate sounds that were not segregated automatically when the same sounds were in the background and irrelevant to the task (Sussman & Steinschneider, 2009). Participants were presented with several conditions of alternating tones that differed in the frequency difference (Δƒ) between the lower (440 Hz) and higher frequency tones. Two conditions of attention were compared: active and passive.
In the active condition, the task was to listen to the lower frequency tones (440 Hz), ignore the higher frequency sounds, and press the response key whenever a louder intensity tone occurred randomly among the lower frequency tones (Figure 4). When participants selected the low set of sounds to perform the task (active listening), they could segregate the sounds at a smaller frequency separation than what occurred automatically during the passive listening condition (see Figure 5, dashed circles). These results demonstrate that attention can refine the stream segregation process by modulating the stream formation process (see Figure 1b).

Frequency separation as a cue for stream segregation.

Attention modulates stream segregation.

Context effects on event formation.

Segregation occurs before event formation.
A framework for understanding how attention interacts with stimulus-driven processes to facilitate task goals is presented. Previously reported data obtained through behavioral and electrophysiological measures in adults with normal hearing are summarized to demonstrate attention effects on auditory perception-from passive processes that organize unattended input to attention effects that act at different levels of the system. A model of attention is provided that illustrates how the auditory system performs multilevel analyses that involve interactions between stimulus-driven input and top-down processes.
Overall, these studies show that:
- Stream segregation occurs automatically and sets the basis for auditory event formation.
- Attention interacts with automatic processing to facilitate task goals.
- Information about unattended sounds is not lost when selecting one organization over another.