Ap Cam

Find The Best Tech Web Designs & Digital Insights

Technology and Design

Sound Localization Accuracy Limitations: A Brain-Inspired Neural Network Approach

The human brain possesses a remarkable ability to perform sound localization in diverse auditory environments, a complex process that integrates audio information from both ears. This binaural hearing enables the auditory system to determine the position and distance of sounds by utilizing specific auditory cues.

Modern hearing aids and cochlear implants often alter or destroy the binaural cues that are essential for accurate sound localization and spatial hearing, making it harder for users to navigate complex auditory environments. Furthermore, technologies such as smart-home devices, autonomous vehicles, and robotics rely on advanced sensing and processing of audio signals to produce an accurate understanding of their surroundings. Computational models inspired by the human auditory system can help to improve our understanding of neural auditory processing characteristics, and lead to the development of efficient, neuromorphic approaches that can be integrated into technologies requiring precise and reliable sound localization.

In this work, we propose a model of auditory processing for azimuthal sound localization in the brain derived from temporal sparse coding via a neural network based on the locally competitive algorithm (LCA). The LCA has advantages over conventional machine learning approaches to auditory modeling, because it incorporates more aspects of a biological network of neurons, extracts features of input signals in a sparse manner, and aligns more closely with the natural processing pathways observed in the human auditory system.

Sound Localization

Unlike other neural network-based models for sound localization, our results show the ability of our model to independently learn to leverage specific auditory cues in a similar manner to the human brain, while providing excellent localization accuracy.

Auditory Cues and Processing

The binaural cues of interaural time difference (ITD) and interaural level difference (ILD) contain information about the difference in arrival time and intensity of a sound between the ears. Spectral filtering from the outer ear (pinna) provides additional frequency-based information that helps to distinguish ambiguities in ITD/ILD cues and to determine elevation. These binaural and spectral cues are processed along the auditory pathway to perform both horizontal and vertical localization of sounds.

In this work, we present the first use of the neural-inspired LCA as part of our model for performing sound localization on binaural audio and investigate the auditory cues utilized by the model in performing this localization task. This approach produces an efficient and effective model that parallels the behavior of the brain when performing sound localization, and has promising applications for use in hearing technologies.

Recent Studies and Models

Recent studies have investigated the use of machine learning models for understanding and modeling aspects of the brain, including sound localization within the auditory system. Prior works have explored the use of deep neural networks (DNNs) to perform brain-inspired sound localization using architectures such as feedforward networks, convolutional neural networks (CNNs), and autoencoders. While these models can reveal interesting parallels to neural functioning and leverage binaural information for performing localization tasks, they often lack the biological plausibility needed to effectively represent the important mechanisms and characteristics within the human brain.

Other classes of localization models take a more biologically focused approach by modeling specific structures and dynamics of the auditory pathway. For instance, spiking neural network models that mimic the medial superior olive and other auditory structures have been proposed to capture the temporal precision and spatial tuning seen in biological neurons. Additionally, probabilistic models that calculate ITD and ILD cues offer robustness in localization but still do not fully integrate these cues in a biologically plausible manner. These models often rely on explicitly calculating binaural cues such as ITDs and ILDs to perform localization tasks, disregarding the importance of other auditory cues processed within the brain.

Prior studies have shown the effectiveness of sparse coding for representing cortical and auditory processing in the brain and have investigated sparse coding for spatial hearing with binaural audio signals. The LCA, in particular, has been shown to learn receptive fields that are similar to the receptive fields of neurons within the Inferior Colliculus (IC) and Auditory Cortex when provided with audio inputs, suggesting that it may be well-suited for modeling auditory neuron behavior in the brain. However, these models are limited in their similarity to human auditory processing because they require the entire audio sample to be available at once, processing it as if it were a static image rather than a temporal signal. In contrast, our model processes audio in sequential time-steps, mimicking the real-time processing of the human auditory system. This approach allows for a more accurate representation of how the brain processes and localizes sound, improving the model’s biological plausibility and applicability to real-world auditory tasks.

Sound Localization Techniques and Technologies

Auditory Model Architecture

Cochlear Front-End

The first stage of our auditory model consists of a front-end based on the auditory periphery. The binaural audio is converted into a spectrogram using a Short-Time Fourier Transform (STFT) in MATLAB. Each frame of the spectrogram is set to 16 ms with an 8 ms overlap to balance temporal resolution with computational efficiency while preserving the timing dynamics necessary for sound localization, resulting in 30 time-steps. We filter the magnitude spectrograms into 64 channels using a log-scaled Mel filterbank with center frequencies ranging from 50 Hz to 8 kHz. This configuration is chosen to mimic the frequency selectivity that is characteristic of the human cochlea.

To further improve localization accuracy, the phase information for frequencies below 1.5 kHz is retained from the STFT from both the right and left ear audio signals. The difference between the phase of the audio from the two ears is calculated which allows the model to utilize timing information as a cue for azimuth determination. The phase information is limited to frequencies below 1.5 kHz to both limit the input size to the LCA network and to reflect the human auditory system’s primary dependence on interaural phase differences at frequencies below this threshold, as phase cues become ambiguous at higher frequencies. The magnitude spectrograms from the right and left ears are concatenated with the phase difference, resulting in a total input size of 177 values over the 30 time-steps. The cochlear stage extracts initial features from the audio inputs to provide the necessary time, frequency, and magnitude information for subsequent stages of the auditory model.

Processing of Binaural Audio Inputs

Our model splits the computations relating to ITD and ILD that typically occur in the brainstem between the cochlear front-end and the auditory midbrain stages. In addition to providing the frequency filtering of the cochlea, the front-end of the model also synthesizes ITD information by taking the phase difference between the right and left ear signals. Magnitude information is passed to the midbrain stage, where ILD is processed.

Sparse Coding Auditory Midbrain via the Locally Competitive Algorithm

Following initial processing of the binaural audio by the cochlear model, we next perform sparse coding of the audio in our auditory midbrain stage using a form of the LCA based on biologically plausible spiking leaky integrate-and-fire (LIF) neurons. This sparse coding allows for efficient representation of the input signal in an overcomplete dictionary of neurons, in a similar approach to how the brain encodes sensory information in an efficient manner. The biologically inspired dynamics of the LCA make it an attractive candidate for modeling sparse coding in the brain.

Each neuron within the LCA behaves as a leaky integrate and fire neuron, where inputs charge the neuron potentials until they reach a firing threshold. Competition in the form of inhibition between neurons allows stronger neurons to suppress those that are weaker from reaching the firing threshold, producing a sparse output representation. In this work, we use a version of the LCA consisting of spiking neurons, where neurons communicate through discrete spikes instead of continuous signals. The final output of the spiking LCA network is the spike rate of the neurons averaged over the run-time of the network.

In prior work using LCA for sparse coding of natural images and audio spectrograms, the inputs are typically flattened, meaning that the multidimensional data is converted into a one-dimensional format before being fed to the input layer of LCA neurons. This approach treats the input as a single time step, with no subsequent inputs during the network activity. This large network size leads to prohibitively long training times for the dictionary and significant computational overhead. Additionally, flattening the spectrogram reduces the resemblance to auditory processing in the brain, which does not have access to the entire audio sequence at each stage of processing.

In this work, we provide sequential time-steps of the spectrogram as input to the LCA, allowing it to process each time-step individually before moving on to the next, rather than providing the entire flattened spectrogram in a single step. This approach mimics the temporal processing of auditory information in the brain, enabling each time-step to be evaluated in sequence. Neuron potentials are not reset between spectrogram slices to maintain the continuity of neural states between consecutive time steps. We use a network size of 200 neurons to form an overcomplete dictionary for our input of size 177. Providing the spectrogram to the LCA in this way allows us to maintain the resolution of the spectrogram while limiting the size of the LCA network, and ensures better biological plausibility.

Cortical Processing by a Machine Learning Classifier

After the LCA stage extracts relevant features from auditory inputs through sparse coding, a feedforward classifier mimics higher-level cortical processing to determine sound location. This stage of our model consists of a feedforward neural network classifier with a single hidden layer of 32 neurons, to balance simplicity and computational efficiency. Before input to the classifier, neural activity from each LCA neuron is averaged over the time period of the spectrogram to produce the relative neural activity during the auditory scene. The classifier outputs the most probable location of the sound’s origin, utilizing a softmax layer to handle classification among potential sound locations.

Localization Architecture Overview

Fig. 2 An overview of the localization architecture. a, In the frequency selective cochlear frontend, binaural audio inputs are filtered an...