Ap Cam

Find The Best Tech Web Designs & Digital Insights

Technology and Design

Auditory Scene Analysis: Understanding the Perception of Sound

Auditory Scene Analysis (ASA) describes how our perceptual system parses the incoming complex vibration (sound) in order to produce a meaningful representation of the environment. It is a fundamental skill of the auditory system that allows us to perceive and identify events in the environment. Auditory scene analysis facilitates the ability to perceive sound events in the environment, such as when listening to your friend talking in a noisy restaurant. Neural processes must account for the dynamics of such situations.

The process of parsing the incoming sound signal into a meaningful representation of the environment is called auditory scene analysis. For example, in the noisy scene of a city street at any given time, some of the sound components reaching your ears may belong to a motorcycle driving by, others to ambient traffic noise, and still others to voices of people on the sidewalk next to you: your auditory system deciphers which is which.

It involves the process of grouping or separating sound events in time, which is called auditory streaming. The elements can either be grouped together (integration), separated in layers (segregation) or separated in successive events (segmentation). Additionally, the auditory system must group incoming sound components into units that are delimited in time (segmentation), for example musical notes, and decide which ones to group together into extended sequences such as melodies. This is called auditory streaming.

Auditory Scene Analysis Explained

Although this particular panoply occurs in the context of a piece of electroacoustic music, the experience of being bombarded with many different sounds is familiar from what the American psychologist William James called 'the blooming, buzzing confusion' of everyday life. How does the auditory system separate all of these sources into discrete perceptual units?

Principles Guiding Auditory Scene Analysis

Complicated though it may be, there are fortunately relatively few principles that guide the auditory system through this task. They are:

  • Harmonicity: Frequencies (or partials) related by simple integer ratios tend to group together. For example, if the auditory scene contains frequencies at 110 Hz, 220 Hz, and 330 Hz (n, 2n, 3n), the auditory system will tend to fuse them together into a single complex sound, whereas frequencies at 110 Hz, 201 Hz, and 350 Hz, which are not related by simple ratios, are less likely to fuse. For instance, you can hear in this example how our perceptive system tends to group the components of a complex tone based on their frequency comodulation.
  • Amplitude comodulation: Sound components that get louder or softer in parallel tend to group together.
  • Source location: Sound components that originate from the same physical location in space tend to group together. As an example, this demonstration shows how we tend to integrate or segregate auditory streams based on their perceived source location (panning).

For the most part, we are unaware that this process is happening, and take it for granted. But before the auditory scene makes it into your conscious awareness, an amazing feat of pre-attentive analysis has already converted the dizzying complexity of air vibrations around you into a coherent picture of the world.

Auditory Stream Segregation

The interesting thing about this, if I would ask you to come and join me at this cocktail party. You would hear people talking, you would hear glasses clinking, the wind is blowing in the background. Maybe there’s music playing. And it is the sounds that enter your ears in a mixture from all the sounds in the environment.

And then the brain disentangles the sounds to provide neural representations that maintain the integrity of these distinct sources. And this is the process of stream segregation - segregating them out and maintaining these distinct representations. And this is what we’re trying to understand: What does the brain do? What happens automatically?

Composers have known about this remarkable ability of the auditory system for hundreds of years. This piece is one of my favorite examples of auditory stream segregation. But in this piece of music we have one timbre, one sound source: the guitar. Now this particular piece of music, guitar piece, uses a challenging guitar technique called the tremolo.

What that is, is the guitarist plucks on a single melodic note with the three fingers of one hand, and you can see they look sort of like triplets here. And in the thumb they play a counter bass melody, and it’s a very challenging technique for a guitarist. But what you’re going to experience is separate sound streams - two melodic streams. And if you don’t know the story that I’m telling now when you were to hear it without a visual of a man playing a guitar, people often think it’s a duet of two guitars.

To take this into the laboratory of playing sounds sequentially and how you segregate them out and hear them as integrated or segregated streams: In our laboratory we have an example here with some different frequency tones that are played sequentially. It’s sort of in a waltzing rhythm: bump bump bump, bump bump bump. And what we do is we vary the distance between the two sets of sounds, and then we want to know how those sets of sounds are represented in the brain is either one or two streams and how people perceive them.

So here, when they’re integrated you can see this sort of waltzing tune. And if they are segregated in frequency we might hear them as two distinct frequency streams. And then they have a different rhythm.

Measuring Brain Processes in Auditory Scene Analysis

Event-related brain potentials are really excellent tools for understanding how sound is indexed in memory and how it’s held in memory. And the way that we get the event-related brain potentials is we record the EEG. We use an electrode cap, and here’s one of our happy subjects.

The subject is sitting in a chair in the sound attenuated booth. They have a cap on with the electrodes. They’re hooked up to the amplifier. They have insert earphones in their ears. We’re recording their EEG, and we play sounds through the insert ear. And for each sound, it’s time-lock stamped onto the EEG record.

And what we do is we segment out those pieces that occur a little before and a little bit after the sound event, and we segment them out and average them together. Because the brain activity is quite noisy. We get all the spontaneous neural activity, and we want to know what is the specific response to the events that we’re interested in. And as we average together these multiple trials, what emerges is a time and phase-locked ERP component. And it has changes in the polarity of this waveform.

This is a classic auditory evoked potential that you’re looking at here. An obligatory response just to the sound onset. What you see on the Y-axis is the display of microvolts, the amplitude of the of the response, and across the X-axis we have time. So you see this is a very rapid response from time onset.

And if you follow from the baseline we have three main components that have different polarities. We have a P1 which peaks around 50 milliseconds. It’s the first positive peak, followed by a negative potential that peaks around a hundred milliseconds we call that N1, the first negative peak. And then there’s a second positive peak around a hundred seventy milliseconds.

But when we want to know what the brain is doing with the sounds we want to get a little more complicated, and we want to understand how they’re processing a multitude of sounds. So now what we want to do - this is the obligatory response - we want to know, this is the exogenous what’s happening stimulus-driven.

Mismatch Negativity (MMN)

So we presented what we call an auditory oddball. One example of what we might do: We repeat a sound over and over, and in this case we change the frequency of the sound represented in pink here. And we want to know: Did the brain detect the difference between repeating regularity, which we call the standard, and the pink one that changed in frequency, which we call the deviant.

So we averaged together the response to all of the standard sounds, which is the black ones that I showed you, and you’re looking at. And then we averaged together the response to all of the randomly infrequent-occurring deviant sounds, that’s represented here in pink. That negative displacement that’s based on the standard is what we call the mismatch negativity. It’s the index that the brain detected something different happened here, and we get this negative displacement.

And the way we visualize the MMN, is we subtract the standard ERP evoked response that we got from the deviant response, and the subtraction eliminates the obligatory components and leaves you with a peak latency of change detection: the mismatch negativity component.

Now the mismatch negativity component - the reason why I’m going into so much detail is that the measure that I’m going to use for all of the experiments that I’m going to talk about. The first thing about it is that it’s modality-specific. It’s generated within auditory cortices. So we know where we are in the brain.

The second thing is that we don’t have to ask the listener what they heard to index sound change detection. So we can record the sounds coming in the ears, and we can understand what’s going on in the background when the listener is doing something else in the scene, whether it’s visual or auditory. They can be ignoring the sounds, in other words. They still have to hear, the sounds but they can be doing a task with them for this change detection response to occur. Or they can be actually listening to and pressing the key for the deviants, and we also get the MMN.

The last thing that’s really crucial because we want to understand about sound organization at a somewhat later level, how they’re represented in memory - the element is strongly context-dependent. And context-dependent means its elicitation is based not simply on the features of the sound, but on the memory of the history of the sounds that have been ongoing.

Here is the random oddball where we present the tone twenty percent of the time randomly in a sequence with another series of sounds. And as I just demonstrated, when we do that we get an MMN elicited by the deviant. And what we would say the standard is, is this repeating regularity in time, location, intensity, duration of the sound. The only thing that changed in the case that I showed you was the frequency of the sound. We could change any feature, and same thing would occur.

And so we get MMN based on the standard. The standard is the context and the deviant is this change in frequency. But if we take those same two tones with the same exact probability, except now what we’re doing is we’re going to group the sequence of sounds so that the deviant occurs every fifth tone - or the pink frequency occurs every fifth tone. Now it occurs the same probability, but now there’s a there’s a sequence of five tones that repeat, repeat, repeat.

And now what happens is MMN is no longer elicited by the deviants that occurred before. The reason for that is that even though it’s infrequent in the block, and it is a different frequency than the black squares represented here, what is the standard now- if the brain detects it - is this five-tone repeating regularity. And you don’t get an MMN from a standard, you get an MMN from something that’s detected as being different.

So the thing that you should understand about MMN, which is something that I did not know when I was a student doing my PhD, and learned this later - very early though as I was working on it - was that you don’t elicit MMN simply by having an infrequent frequency or feature that occurs in sequence. There has to be a standard that’s detected that serves as a basis. And most of the MMN studies that were done use the oddball paradigm, so the standard was just a repeating sound.

So when MMN is elicited you know that a change was detected and you can infer what the standard was. And that’s going to form the basis of all the experiments I’ll talk about.

Automatic Processes in Auditory Scene Analysis

What happens when you don’t have a task for the sounds? Are they organized or are they simply background noise?

To explain this, I need to go through a few processes because I’m going to show you a number of different things that occur in this particular situation that we’re looking, at not just stream segregation. So first I’m going to talk about a phenomenon that we discovered earlier on.

So here what we have is, as you can see, a somewhat modified oddball. That every time a deviant occurs, another deviant follows it. Every time a deviant occurs, another deviant follows it. And the thing to notice here is we have a very rapid ISI. A very rapid stimulus onset rate of a hundred fifty milliseconds, which is within the temporal window of integration. So we would say that some elements that fall within the temporal window of integration tend to get integrated together. You need to note that because it works differently with longer stimulus rates.

So, every time a deviant occurred, another one follows it. And what we found was that we got one MMN to those two successive events. Now another way you might interpret this, you might say maybe the MMN generators are refactoring and they just can’t put out two MMNs within a hundred fifty milliseconds.

Well, we ran another condition where we mixed up what the deviants were. So now every time a deviant occurred, it wasn’t fully predicted that another deviant would follow it. We had what will call single deviants and double deviants in the block at the same pace. And now what happened was we got two MMNs. This same exact physical input put out two MMNs in the mixed condition whereas one MMN in the block condition.

And what this indicates here is that this second deviant here is giving new information, so we need to represent that in a different way. One other thing to look at here, when they were integrated together, you can see the peak latency of the MMN was longer than when they were two separate ones. These are separated by a hundred fifty milliseconds - remember that timing with evoked potentials is very precise. So we have the timing of these two MMNs, one hundred fifty milliseconds apart.

Now we take that phenomenon, an oddball, into a streaming paradigm. And now what we did was we have a set this this set of low sounds, four hundred and forty hertz, and then a set of higher sounds quite far away so they should automatically segregate if you’re not paying attention to the sounds. But now the pace is seventy five milliseconds onset to onset and so you have seventy five milliseconds from low to high low to high, they’re alternating, and we maintain this hundred fifty milliseconds between double deviants, the same that we saw, but the only difference is that now we have a tone that actually occurs in the input to the ear - there’s a high tone that occurs in between every low tone so this is a tone here.

And what we wanted to know was: that if the sounds segregate automatically, and you get a representation of the low and a representation of the high held separately in memory, then the contextual effects can occur on the low stream, and should we get one MMN from in the block condition and two MMNs in the mixed condition.

What happened was we found that we got one a MMN in the block condition and two MMNs in the mixed condition, which to us indicates that the within-stream event formation is based on the already segregated input.

So if you think about yourself in a cocktail party that we were visiting before and the first thing that happens when you enter the room is you all the of sounds are segregated based on the spectral/temporal characteristics of the input, and then the events that are within every stream get formed.

Let’s put that into a schematic model of what would be happening here. Look at all of the different processes that are happening in a bottom-up fashion. We have the sound input coming in, it gets parsed based on the spectral/temporal cues, and then the streams are formed, events are formed on the already segregated information. So the standards are formed here, and then the deviance detection occurs, and then we get MMN.

Attention plays an important role in how we perceive and understand the environment.