Ap Cam

Find The Best Tech Web Designs & Digital Insights

Technology and Design

Auditory Scene Analysis with Deep Learning: A Comprehensive Tutorial

Auditory scene analysis (ASA) aims to retrieve information from the acoustic environment by carrying out three main tasks: sound source location, separation, and classification. Auditory scene analysis (ASA) aims to provide a description of the sound sources in the environment [1], including their location, their separated audio stream, and their type (human, urban sound, noise, etc.). These tasks are traditionally executed with a linear data flow, where the sound sources are first located; then, using their location, each source is separated into its own audio stream; from each of which, information is extracted that is relevant to the application scenario (audio event detection, speaker identification, emotion classification, etc.).

However, running these tasks linearly increases the overall response time, while making the last tasks (separation and classification) highly sensitive to errors of the first task (location). A considerable amount of effort and computational complexity has been employed in the state-of-the-art to develop techniques that are the least error-prone possible. However, doing so gives rise to an ASA system that is non-viable in many applications that require a small computational footprint and a low response time, such as bioacoustics, hearing-aid design, search and rescue, human-robot interaction, etc.

To this effect, in this work, a multi-agent approach is proposed to carry out ASA where the tasks are run in parallel, with feedback loops between them to compensate for local errors, such as: using the quality of the separation output to correct the location error; and using the classification result to reduce the localization’s sensitivity towards interferences. The result is a multi-agent auditory scene analysis (MASA) system that is robust against local errors, without a considerable increase in complexity, and with a low response time.

This workshop is an introduction to audio and music processing with an emphasis on signal processing and machine learning. Participants will learn to build tools to analyze and manipulate digital audio signals with PyTorch, an efficient machine learning framework used both in academia and industry. Both theory and practice of digital audio processing will be discussed with hands-on exercises on algorithm implementation. These concepts will be applied to various topics in music information retrieval, an interdisciplinary research field for processing music-related data. No pre-requisites, but some knowledge of python is assumed.

Auditory Scene Analysis Principles

Principles of Auditory Scene Analysis

Limitations of Traditional ASA Systems

There are several limitations that come with this type of data flow. First off, if an error occurs at the beginning of the data flow, such an error is ‘passed on’ to the following tasks. Meaning, a localization error begets separation degradation, which in turn begets a wrong classification. To tackle this issue, local errors (those that occur at the task level) are usually minimized by making the tasks’ techniques robust against the local errors of prior tasks. This has the added issue that complexity is added onto the tasks’ techniques, which requires increasing the use of computational resources.

Additionally, this added complexity may increase the technique’s response time, which is further aggravated by the fact that, in linear data flows, the overall response time is the sum of the response times of all the tasks. An increase to this overall response time may limit the system’s viability in real-time applications [17, 18].

To this effect, in this work, a new paradigm to carry out ASA is proposed, where it is structured as a multi-agent system (MAS) [19]. This type of systems aim to model very complex behaviors as a set of small computing entities (agents) that run in parallel, and that solve smaller, simpler tasks. The key to a MAS is that these agents are expected to interact with one another [20], employing a non-linear information data flow. This inter-agent interaction is from which the complex behavior is expected to emerge [19].

Using a MAS to solve a complex task has several benefits, such as: efficient operations due to its parallelization potential, reliability and robustness, agents can be added or removed to fit the needs of the application scenario, less computational cost compared to a centralized approach, etc.

Multi-Agent Auditory Scene Analysis (MASA)

Recently, an effort into this multi-agent paradigm was carried out [22] where a feedback loop was implemented between the separation and the localization tasks, to correct location errors in real-time based on speech quality. It showed strong robustness against location errors of up to 20∘, which is strong evidence of the viability of the proposed multi-agent approach.

In the work presented here, the effort of [22] is expanded upon to formalize it into a complete framework, with several local improvements to its currently implemented agents and additional feedback inter-agent loops. All the agents can be seen that are currently a part of MASA, as well as how they are inter-connected.

MASA System Diagram

Multi-agent auditory scene analysis (MASA) system

Sound Source Localization Agent

This agent is based on the work of [25], where a lightweight technique was proposed to estimate the direction of arrival (DOA) of multiple mobile speech sources using only three microphones. It was shown to be able to estimate the DOA of up to 4 simultaneous speech sources, and was able to track several mobile sound sources. It assumes that simultaneous sources do not fully overlap, such that several DOA “candidates” are gathered over time.

Building on [25], a novel multi-speaker localization and tracking technique was developed using a triangular microphone array [26], which is described here for completeness sake. GCC-PHAT enhances robustness against reverberation, while frequency filtering reduces noise-induced false positives. This also enables multi-speaker detection via clustering of arrival directions, since only coherent DOAs are provided to the tracking side of the localization technique.

Speaker movement is modeled as circular motion, with variable velocity and normally distributed acceleration. This provides better adaptability than constant-velocity models, while being more efficient than complex non-linear alternatives.

This localization system was compared with another popular lightweight sound source localization system, which is part of the ODAS framework [30]. It employs beamforming for source localization instead of TDOA estimation, as well as particle filtering for source tracking instead of Kalman filtering. The performance was measured in terms of number of speakers detected and DOA accuracy. The computational efficiency of both systems was also registered, and measured as the CPU usage percentage of one Intel Core i5-7200U core.

The results showed that in reverberant/noisy scenarios, the performance gap between the proposed system and ODAS using 8 microphones was comparable. Given that ODAS was later improved in [32], this performance gap has potentially narrowed even further. However, ODAS with 3 microphones performed worse in detection and false positive rate when tested under identical conditions. Additionally, there was a ∼50% reduction in computational requirements, since [26] only used ∼35% of the CPU with 3 mics, while [32] used ∼70%.

Speech Enhancement Agent

Real-time speech enhancement has improved recently, in great part because of lightweight models such as Demucs [33], which has an architecture based on the U-Net paradigm [34], as shown in Figure 4. However, it has been shown that it performance drops substantially in multi-speaker scenarios [35]. This is understandable, given the “one speech source to enhance” assumption that most (if not all) speech enhancers are trained with. However, a phase-based frequency-masking beamformer [36] can be used to “nudge” the Demucs model towards the speech source of interest.

The beamforming output increases the energy of the steered source (although it does not separates it completely from the rest of the noisy mixture). In conjunction, it has been shown that the Demucs model tends to separate the highest-energy speech source [35] in multi-speaker scenarios. Thus, a real-time location-based target selection strategy has been previously proposed [37] which steers the Demucs model (here referred as demucs) to enhance a speech source that is located at a given DOA.

Unfortunately, this strategy has been shown to be sensitive to location errors [37]. This issue prompted a real-time correction system of the DOA by maximizing the quality of the enhanced speech [22], which is the predecessor of the multi-agent approach here proposed.

In the work presented here, the demucs speech enhancement model is modified so that it takes advantage of a virtue of the aforementioned phase-based frequency-masking beamformer [36]: it not only is able to produce a preliminary estimation of the steered sound source, but also of the cumulative environmental interference [36]. This two-output version of the beamformer is herein referred to as beamformphasemix, while the original one-output version is beamformphase.

As it can be seen, demucsmix outperforms demucs, but it is at the cost of memory usage. It is worth pointing out that it may seem unnecessary that the model should carry out the noise decoding stage that estimates the cumulative environmental interference (N^^ in Figure 7), since it surely is contributing to its increase in memory usage. However, other architectures were also tested that did not do this, and their performance did not improve upon the original demucs model. Once the model was given the task to also estimate the interference, the performance showed the improvements presented in Table 1. The reasoning behind this phenomenon may be because the sequence modeling stage of the model requires to identify which part of the input signal is the target speech, and which is not.

Demucs Architecture

Demucs Architecture

Speech Quality Assessment

In [22], the speech quality is measured by using the Squim model [40] which does not require a reference recording to provide a quality estimation (aka. a non-intrusive quality estimation). The Demucs model was originally run with an input window length (tisubscript) of 0.064 s. The Squim model provides very inconsistent results, with a high amount of variance, even within a single static recording. The impact of its capture window length (twsubscript) of 3.0 s.

In all of the following evaluations, a recording of the AIRA corpus [31] was used, where the target source is positioned near 0∘ and an interference is positioned near 90∘. The speech sources were positioned around an equilateral-triangular microphone array, each microphone pair spaced 0.18 m. apart. The recording has a length of 30 s., but it was repeated 4 times to provide ample time to evaluate the consistency of the optimization process. This recording was sampled at 48 kHz, and was fed to the beamformer with a 1024-sample window. It then was re-sampled at 16 kHz in real-time (since the Demucs model and the Squim model were trained at this sample rate), and fed to the rest of the modules with the window values explored in this section.

Using the reference recording provided along with the previously described recording in the AIRA corpus, the SDR was calculated from the output of the Demucs model using different values of tisubscript. The estimated DOA (θest) was set at 15∘, which is in a non-optimal location, so that the quality ceiling isn’t reached. As it can be seen, a tisubscript value of 0.512 s. provides the best SDR results. This was confirmed from subjective listening sessions, where it seemed that shorter tisubscript values resulted in more discontinuities between tisubscript windows. Longer tisubscript values resulted in the Demucs model not responding fast enough given the larger amount of data that it was fed.

Using the previously described recording, θest was set at 0∘ (to measure the highest possible SDR), and the standard deviation of the SDR estimated by the Squim model was calculated using different values of twsubscript. The value of thsubscript was set at 0.5 s. to provide a stable comparison. As it can be seen, a twsubscript value of 3.0 s provides the least amount of SDR variation.

The work in [22] presents a DOA correction scheme by maximizing the speech quality as assessed by the onlinesqa agent. The Adam-based optimization process establishes the speed of its adaptation with the learning rate parameter (η), the value of which was extensively explored originally. However, its value is highly dependent on the amount of time it takes to receive a new quality metric (thsubscript).

There is only one possible solution to the optimization task (the highest quality is at the correct DOA) and the solution space is very close to convex in the vicinity of the correct DOA. The soundloc agent continuously provides its own estimated DOA (θest), not just at the beginning of the optimization process.

To bound the possible combinations of thsubscript and η values, first, a similar evaluation to the one in the Section 2.3 was carried out to find the thsubscript values that provide the least amount of SDR variance. As it can be seen, thsubscript values between 1.0 s. and 2.0 s. provide the least amount of SDR variance.

With this information, another evaluation was carried out to find the best combination of thsubscript and η values. To do this, for every combination, 10 runs were carried out (obtaining the value of θcorr that maximizes Q), with varying amounts of location errors (θest). A ‘good run’ i...

Deep Learning for Audio Analysis: Building a Sound Event Classifier

Table 1: Performance Comparison of Demucs Models

Model Performance Memory Usage
Demucs Base Performance Lower
Demucsmix Outperforms Demucs Higher