Ap Cam

Find The Best Tech Web Designs & Digital Insights

Technology and Design

Understanding Sound Localization: Principles, Techniques, and Applications

As the name suggests, sound source localization means determining where the sound of our source of interest originates.

Sound source localization can be broken down further depending on the environment of where the sound originates. Imagine someone in an underground garage clapping their hands. After the clapping stopped, their sound waves reflections will linger in the room for a short period after. These acoustic reflections are a part of a reverberant environment in which the reflections interfere with the direct sound arriving at the listener's ears, distorting the spatial cues for sound localization. The ear perceives the sound to be farther or closer than it is, which adds another layer to the problem. Although humans can quickly localize sound sources in moderate reverberation, localization accuracy degrades in a stronger reverberant environment. With this in mind, the need for a breakthrough in technology to solve this has proven to be in dire need.

When we hear a sound, how do we know where it's coming from? How do we know whether its near to us or far from us? How do we know in which direction the sound source is? And how do we get a sense, solely from sound, of the kind of space we're in? Even when we listen to recorded or synthesized sound, and know that the sound is coming from loudspeakers or headphones, we still can have a sense of a sound's location in the virtual (imagined) space in which the sound seems to be occurring. This, too, is a form of localization.

In a concert recording or studio recording of music, or in a 3D computer game, the sound engineer may use digital audio processing techniques to create a particular localization effect.

Sound Localization

Factors Affecting Sound Localization

We gauge how far a sound source is from us mostly by how loud it is, relative to how loud we know it to be from prior experience. That is, we all have a large repertoire of sounds we have heard in different situations in our lives, and for which we have unconsciously noted the intensity relative to their distance from us. For example, we have all heard a basketball being bounced, both close-up (by ourselves) and far away (by someone else at the other end of the court). So, we can compare a current sound to our prior experiences to guess how far the sound is from us.

A given sound has an amount of power, which is defined as the amount of energy transferred from the source (in all directions) per unit of time (P = E/t). The intensity of a sound is its power measured over a certain area (I = P/a). Our tympanic membrane (a.k.a. our eardrum) and a microphone are both devices that measure sound intensity. When a sound arrives at our eardrum or at the diaphragm of a microphone, either of which has a certain surface area, the power in that area (i.e. the intensity) is detected.

However, the intensity of a sound, as measured by an eardrum or a microphone, will differ depending on the distance from the sound's source, because the sound is being emitted from the source in all directions. If you think of the sound energy as radiating outward from the source in a spherical pattern, and you bear in mind that the surface of a sphere is proportional to the square of its radius (the surface area of a sphere is equal to 4πr2), you can understand that the intensity of a sound as measured in a given surface area is inversely proportional to the square of the distance of the point of measurement from the sound source.

Our subjective sense of a sound's "loudness" is not the same as its intensity, but is generally roughly proportional to it. But what does that mean in terms of the amplitude factor we'll use to alter a sound's intensity in digital audio? As defined in physics, the intensity of a wave is proportional to the square of its amplitude (A2 ∝ I). So that means that if we want to emulate the effect of a sound being twice as far away, (1/4 the intensity), we would need to multiply the amplitude by one-half (because (1/2)2 = 1/4).

The preceding discussion of the inverse square law assumes an idealized open space in which the sound is free to radiate in all directions from the source with no reflections. In reality, there is always at least the ground, and usually some walls and a ceiling, off of which the sound reflects, complicating the idealized spherical model somewhat. Some sound goes directly to the listener, and some strikes the floor, ceiling, and walls. When sound strikes a surface, some of its energy is absorbed by the surface (or transmitted through the surface) and some of it is reflected. The sound that's reflected (after being diminished somewhat by the absorption) may also reach the listener, as if it came from the surface itself (or a virtual source beyond the surface). These reflections (and reflections of reflections) travel a longer distance than the direct sound, so they're very slightly delayed relative to the direct sound, and they are so numerous and quick that they blend together, causing a slight amplification and prolongation of the sound, known as reverberation.

The balance between direct sound and reverberated sound is another cue that helps the listener have an idea of the proximity of the sound source. For example, when someone whispers into your ear, you hear almost exclusively direct sound, whereas when they talk to you from across the room you hear much more reflected sound in addition to the direct sound. So, in addition to intensity, the balance of direct ("dry") sound and reverberated ("wet") sound also give a sense of space.

In addition to sound absorption and reflection, another physical phenomenon that occurs when sound waves encounter a physical object is diffraction, which is the change of direction that occurs when sound passes through an opening or around an object. Any less-than-total barrier, be it a partial wall or any other object, leaves space for some sound to get past it, and the sound continues to radiate outward from that opening, effectively seeming to bend around the obstacle.

A sound's wavelength is inversely proportional to its frequency (f ∝ 1/λ), and for that reason, lower-frequency waves have a greater wavelength than higher-frequency waves. For many absorptive surfaces, high frequencies likewise tend to get absorbed more readily than low frequencies. So, reflective surfaces might have a lowpass filter effect, as well. And even the air itself absorbs sound a little, and tends to absorb higher frequencies more than lower ones. So, the greater the distance of a sound source from the listener, the greater the lowpass filtering effect of air absorption.

Another important factor for sound localization, equally important as distance, is the angle of the sound source relative to the listener's orientation. When discussing angles in 3-dimensional space, one can refer to altitude (known as the elevation angle) and direction (known as the azimuth angle). We don't often hear sounds that come from radically different heights, whereas we often hear sounds coming from different directions. We're able to localize a sound's direction because we have two ears. If a sound is directly in front of us, it arrives at our two ears at exactly the same time and with equal intensity. However, if the sound is at all off of that central axis, it will arrive at our ears at slightly different times and with slightly different intensities. We're highly sensitive to these very subtle differences, known as the interaural time difference (ITD) and the interaural intensity difference (IID), and we use them to determine the azimuth angle of a sound source.

As depicted below, an off-axis sound takes longer to get to one ear than the other, and has to diffract around the head with a related slight loss of intensity and high-frequency content. The ITD is commonly very slight, even less than a millisecond, but is enough to give us a sense of azimuth angle. This effect is known as the precedence effect or the Haas effect. Helmut Haas demonstrated that the precedence effect (ITD) can overcome IID; we base our sense of a sound's direction on the first, direct version of a sound, even if a reflected version of the sound comes from a different direction with somewhat greater amplitude than the direct sound.

Understanding Interaural Time Difference (ITD) and Interaural Level Difference (ILD)

Audio engineers will sometimes exploit the Haas effect by delaying a sound going to one of two stereo speakers, to give the sound a sense of directionality in the stereo field. Delaying the sound to one channel by several milliseconds-more than ordinary ITD but not so much as to be heard as a discrete echo-can give a sound a more "spacious" feel, perhaps because the listener becomes slightly confused about its virtual direction. An audio engineer might also create a difference in the intensity of the sound going to the two stereo channels, so as to give the listener an impression of significant IID.

Stereophonic recording and playback, the use of two separate channels of sound so that the listener's two ears receive different signals, was invented in the 1930s and has been the norm in recorded music since the 1960s. The perceived azimuth angle of a sound in virtual space can be influenced by the balance of intensity between the two speakers. The following examples from the Max Cookbook teach about techniques for intensity-based stereo panning. The most favored of those methods is constant-power panning (a.k.a. Many other systems have been devised, using four, six, eight, or more speakers, to create a more realistic 2D or 3D sound spatialization that goes beyond stereo.

Applications of Sound Source Localization

Imagine being inside a Madison Square Garden concert game or concert. Everyone around you is yelling at the top of their lungs, except one person. Now, we want to find that one person. Sound source localization will help us isolate this person and determine where he or she is in the crowd. While this is a trivial example, multiple applications require sound source localization such as in hearing aids, robotics, navigation for ships as well as self-driving cars, and in surveillance too.

Applications of Sound Source Localization

Techniques for Sound Source Localization

Previous work in sound source localization has concerned the design of microphone arrays and the use of digital signal processing techniques.

These techniques can be broken up into four groups: Time difference of arrival (TDOA) methods, beamforming ones, methods using high-resolution processing, and the processes which need a training phase.

Time Difference of Arrival (TDOA)

Time difference of arrival (TDOA) is a technique that involves using two or more receivers to locate a signal source from the different arrival times at the receivers. In our case, it is a sound source signal. Popular techniques used to estimate TDOA are the Generalized Cross Correlation (GCC) and its derivatives, such as Generalized Cross-Correlation using Phase Transform (GCC-PHAT) and the Cross Power Spectrum Phase (CSP). However, these methods are defined for an environment without any vibration, so they do not help localize reverberated sound sources.

Beamforming

Beamforming, on the other hand, or spatial filtering is a signal processing technique that combines elements in an antenna array in such a way that at particular angles signals experience constructive interference while others experience destructive interference. Using a microphone array, beamforming will help isolate the source of the sound. The best-known beamforming approaches are the Minimum Variance Distortionless Response (MVDR), and linearly Constrained Minimum Variance (LCMV) method. However, when a microphone array is faced with multiple sound sources, the TDOA and beamforming approaches are not successful in finding the source. Hence, the other two methods were created.

High-Resolution Processing Methods

Next, the methods using high-resolution processing, known as subspace localization methods, utilize the spectra estimation, and perform better than in comparison to the TDOA and beamforming approaches. Common examples of subspace localization methods are the (MUSIC), Estimation of Signal Parameters via Rotational Invariant Technique (ESPRIT) and root-MUSIC. Due to the nature of the reverberant environment, other methods such as the Recursively Applied and Projected MUSIC, RAP-MUSIC, and Self-Consistent MUSIC are other choices as well but are not widely implemented.

Training Phase Methods

Finally, the last approach is a reasonably recent advancement. A new method, based on the phase information of the MUSIC spectra, for localization of very closed-source with the limited number of sensors, has been proposed in a journal paper. However, because of its novelty, there is not much more to report, and more work needs to be conducted before one can test its usefulness.

Sound Localization in Hearing Aids

The primary goal of most hearing aid fittings is to provide audibility, optimize intelligibility, and maximize sound quality. A secondary goal, which often is overlooked and seldom assessed clinically, is aided localization. The ability for hearing aid users to detect where sounds are coming from is important for many reasons. Obvious examples involve safety (a car horn honk on a busy street) or identifying a sound based on its location. Knowing from where a voice originates or locating a new talker in a group provides important visual cues and also assists in effective communication and may lead indirectly to improved speech understanding. Moreover, simply locating and pairing sounds with visual stimuli in the everyday environment assists in relaxed listening and an enhanced overall enjoyment of the world.

For people who have hearing loss, the assumed benefit of using amplification is that by providing audibility, especially in the high frequencies, localization is improved. It is possible, however, that no improvement is noticed, or in the worst case, aided localization actually is worse than unaided performance. There are few data available regarding aided localization satisfaction, as most outcome measures do not directly address this topic.

Over the years, there have been numerous studies of localization for normal-hearing individuals. To discern left-versus-right discrimination, we know that spectral differences, interaural time differences (ITDs), and interaural level differences (ILDs) are important. Left/right discrimination deteriorates significantly when hearing aid users are tested with different microphone modes fitted on the two ears, or adaptive directionality. Thus, synchronization of left and right hearing aids is required to ensure good left/right discrimination.

However, for localization of elevation (above or below ear level) and for front/back differentiation, the determining factor is the monaural high frequency spectral cues that are shaped by the pinna. ILDs and ITDs are the same for frontal and rear sound sources, and thus cannot be used for front/back discrimination. Due to reflections of incoming sound in the pinna, high frequencies are shaped differently for sounds coming from the front and rear. These differences are used by the auditory processing centers to determine whether a sound source is located in front or behind.

Localization Cues

Because of the importance of both binaural processing and audibility for localization, it would be expected that well-fitted bilateral hearing aids result in similar localization ability between hearing-impaired and normal-hearing individuals. Unfortunately, even with bilateral fittings, localization problems remain, perhaps due to such factors as cochlear temporal distortions, poor high frequency audibility, long-standing hearing deprivation, direct-versus-amplified sounds, and/or hearing aid processing delays and features (eg, compression, directional technology, noise reduction algorithms, etc).

Another factor that can have a negative impact on aided localization, especially for elevation and front/back performance, is hearing aid microphone location. This is particularly an issue with behind-the-ear (BTE) hearing aids, where the microphone location essentially negates the normal pinna and concha collection and shaping properties. There is some improvement for traditional ITE custom hearing aids, as the microphone is now located in the concha. When deep-fitted CIC products (faceplate located at the opening of the ear canal) were introduced in the 1990s, a reported benefit of this hearing aid style was improved localization. Research with this product indeed revealed that both horizontal and vertical localization for hearing-impaired individuals (mild-to-moderate losses) was significantly better than with a conventional ITE instrument, and essentially equal to that of people with normal hearing.

As mentioned, one factor that is important for elevation and front/back localization is the spectral shaping provided by the pinna. One method, therefore, to potentially enhance aided localization for a BTE microphone placement would be to attempt to replicate this unique shaping through processing of the input signal. A new hearing aid technology to accomplish this shaping is TruEar by Siemens. In the human ear, due to the reflections and resonant characteristics of the outer ear, frequencies above 1.5 kHz are amplified for sounds from the front relative to other azimuths. As a result, a positive directivity index (DI) is obtained for these frequencies. With conventional BTEs, directionality at high frequencies is reduced compared to the unaided human ear, as sound is picked up effectively behind the ear and thus does not include the direction-specific shaping by the pinna. Hence, it is not surprising that front/back discrimination might be degraded with BTEs.

In order to restore aided localization with BTEs, TruEar mimics the frequency-specific directivity of the human outer ear. This is accomplished by adjusting the hearing instrument’s directional microphone system to match the directivity pattern of the pinna as closely as possible. As shown in Figure 2, hearing instruments equipped with TruEar provide a very similar directivity at high frequencies to the unaided ear.

TruEar Technology

The benefits of TruEar were compared not only to omnidirectional, but also to hearing aids using traditional full-directional technology in a clinical study. Aided localization performance was measured on the hearing-impaired listeners at the initial fitting and 3 weeks post-fitting with each scheme to check for adaptation effects. Aided localization performance was tested in a medium-sized anechoic chamber using an array of 20 loudspeakers separated by 18°. The localization test was performed using stimuli with a variety of spectral characteristics: 0.4 kHz and 3 kHz octave band-filtered pulsed pink noises, squawking birds (a high frequency weighted broadband noise), traffic noise (a low frequency weighted broadband noise), and speech. The test conditions changed the spectral emphasis for frontal versus rearward sounds.

The results of the study revealed that when unaided, the hearing-impaired listeners performed significantly poorer than the control group of normal-hearing listeners. This was true across all stimuli, especially in the front/back dimension where front/back confusions were prominent. When aided bilaterally in the omnidirectional mode, localization performance was similar to the unaided performance. When aided with either type of TruEar, there was no significant difference (as compared to omni or each other) for left-versus-right discrimination. Full directional processing, however, significantly degraded L/R discrimination for high frequency weighted stimuli.

As discussed earlier, one of the primary design goals for TruEar was to assist with front/back discrimination. The results at 3 weeks post-fitting are shown in Figure 3. Observe that, for the three stimuli with sufficient high frequency information (3 kHz pink noise, squawking birds, and speech), front/back discrimination was significantly improved with TruEar. Full directional processing shows a trend to improve aided localization; however, these results were not significant. Similar results were obtained for the alternative implementation of TruEar, which mimics the open ear only above 2 kHz.

Front-Back Discrimination with TruEar

These findings suggest that a rather close approximation of the directivity of the unaided ear is required to achieve a significant improvement. In general, the subjects made significantly fewer front/back errors with TruEar than with omni or full directional processing. Not surprisingly, there was no significant difference across test conditions for stimuli that do not carry high frequency energy (0.4 kHz pink noise and traffic).

In addition to mean values of RMS errors, a distribution was calculated for the “best” fitting for localization. Figure 4 shows that TruEar processing was beneficial for most (more than 75%) of the 21 hearing-impaired listeners for stimuli with high frequency cues. About 20% of the participants, however, did not show improved front/back localization with TruEar.

Distribution of Localization Performance

Although some patients already demonstrated a clear improvement in localization with TruEar directly after fitting, a significant group effect was observed only after 3 weeks. Apparently, most subjects require an adaptation of at least 3 weeks to be able to show benefit from TruEar. It is currently unknown if further improvement occurs beyond 3 weeks.

Future Trends in Sound Source Localization

Unlike humans, the machines that use these techniques are not as robust in all environments and cannot find the source because they assume the source to be either stationary or in a non-reverberant environment. SONAR and RADAR are extremely useful navigation systems because transmitting or finding vessels in a setting where the reverberations are not so high-underwater sound waves-is a simple procedure. However, if SONAR or RADAR were used in a glass room to find a vessel, the results would not be promising. These limitations need to be surpassed, so technology can accurately locate the origin of the sound.

With the recent advancement of personal data assistants such as Google Assistant and Siri, there has been a lot of development in the Speech-Language Processing field. The rise brings about new methods to solve the source localization problem.

In this decade, machine learning will help alleviate problems in solving sound source localization in almost all environments. In particular, deep learning, a subset of machine learning, has yielded some exciting results in terms of detecting the sources with networks like SELD-net.