Ap Cam

Find The Best Tech Web Designs & Digital Insights

Technology and Design

Infant Auditory Perception Development: A Focus on Spectro-Temporal Acoustic Cues

The auditory system encodes the phonetic features of a given language by processing fine spectro-temporal acoustic changes in the speech signal. Even with a relatively immature auditory system (Moore, 2002), infants have been shown to distinguish phonetic contrasts in a language-specific manner before the end of their first year of life (see Kuhl, 2004; Saffran et al., 2006). However, it remains unclear whether infants and adults rely on the exact same acoustic information when discriminating native phonetic contrasts.

To this aim, the current study compares the reliance upon spectro-temporal acoustic cues of speech in a phonetic feature discrimination task between infants at two ages (6 and 10 months) and adults. To explore infants auditory processing of speech, the present study uses a psychoacoustic approach that has been described extensively over the last decades and modeled the stages of auditory processing in adult listeners (c.f., Moore and Linthicum, 2007).

A key concept of this psychoacoustic approach is to consider that the human auditory system decomposes any complex acoustic signal (including speech) into its fine spectral and its fine temporal modulations. The decomposition of the spectral modulations is related to the sensitivity of inner hair cells within the basilar membrane of the cochlea to a specific audio frequency range. The selective spectral processing of audio frequency from the high frequencies at the base of cochlea to low frequencies at the apex can be modeled as a bank of narrowband filters with a passband equal to one equivalent-rectangular bandwidth (ERB, Glasberg and Moore, 1990; Moore, 2003).

Basilar Membrane

Basilar membrane of the cochlea

Then, the auditory system is thought to decompose the temporal components of each extracted narrowband signal at two main time scales: relatively slow amplitude variations over time (amplitude modulations or AM, often referred to as temporal envelope), and relatively fast oscillations over time (frequency modulation or FM, often referred to as temporal fine structure). These models helped to develop speech analysis-synthesis tools, called vocoders, to assess selectively the specific role of spectral and temporal components in speech perception.

The Role of AM and FM Cues in Speech Perception

In adults, a wealth of studies using vocoders showed that FM cues convey essential information related to voice pitch, and play an important role in speech perception in quiet for lexical-tone languages (using pitch at the syllable level, e.g., Zeng et al., 2005; Kong and Zeng, 2006). Moreover, sentence recognition has been found to be more difficult when only FM cues are preserved in the signal (Gilbert and Lorenzi, 2006; Lorenzi et al., 2006; Sheft et al., 2008; Hopkins et al., 2010), but FM cues provide crucial information in noisy environments (e.g., Zeng et al., 2005; Hopkins et al., 2008; Hopkins and Moore, 2009; Ardoint and Lorenzi, 2010).

Nevertheless, AM cues have been found to convey information related to syllabic and phonetic information that allow word and sentence identification in quiet listening conditions (Rosen, 1992; Shannon et al., 1995; Smith et al., 2002; Zeng et al., 2005; Lorenzi et al., 2006; Sheft et al., 2008). This was initially demonstrated by Shannon et al. (1995) using noise-excited vocoders to investigate the impact of spectro-temporal degradation on speech identification. In that study, the researchers took original input sentences and applied a filter-bank to decompose the signal into 1, 2, 3, or 4 frequency bands from which the original AM and FM cues were decomposed. While the FM was replaced by a noise carrier in each band, the AM cues were low-pass filtered at different cutoff frequencies (16, 50, 160, or 500 Hz). Sentence identification scores in quiet were almost perfect in the 4 band-AM condition but decreased with a reduced number of frequency bands. Moreover, sentence recognition scores were worse in the condition where AM cues were preserved only below 16 Hz.

While it has been repeatedly observed that adults are able to correctly identify speech in quiet with only the slowest AM cues (<8-16 Hz), the identification of individual phonetic features becomes more nuanced in terms of what acoustic cues are used. Using confusion matrices of phonemes, Shannon et al. (1995) showed that the reduction of faster AM cues (>16 Hz) significantly affected consonant identification, but not vowel identification. Moreover, for consonants, the identification of place of articulation remained challenging even in the 4-band AM condition.

More recently, Xu et al. (2005) conducted a systematic study to determine the importance of various spectral and temporal information in phoneme identification. English-speaking adults were asked to identify consonants and vowels that varied in voicing, place of articulation, manner of articulation, duration, first formant (F1) frequency and second formant (F2) frequency. Syllables were vocoded using different numbers of bands (ranging from 1 to 16) and different low-pass filters for AM extraction (ranging from 1 to 512 Hz). Their findings showed that the optimal low-pass cutoff frequency for consonant recognition was 16 Hz, whereas for vowel recognition it was 4 Hz. Regarding spectral information, consonant recognition performance reached a plateau at 8 bands, while for vowel recognition it was 12 bands.

These findings from adult studies show that AM cues are the most important cue for overall speech recognition in quiet (i.e., at the sentence recognition level), but that identification of consonants and vowels require different contributions of fast and slow AM, and FM cues. In other words, this demonstrates that various spectro-temporal cues play distinct functional roles in phoneme identification.

Auditory Development in Infants: Understanding Speech Perception

Vocoder Studies in Infants

To tackle developmental issues, vocoders have also been used to investigate how young listeners and especially infants use acoustic cues when processing speech sounds. Although this field of research is still largely emerging, the first infants studies using vocoders suggest that AM and FM cues have a different role at early ages compared to adults.

For vowels, only one study to date has assessed English-learning 6-month-olds ability to detect a phonetic change in degraded speech. This study tested discrimination between /a/ and /i/ in vocoder conditions reducing FM cues and the number of spectral bands for AM extraction. Infants were found to detect a vowel change when the original AM (160 Hz cut-off frequency) was presented within 32 bands, but not when it was presented within 16 bands (Warner-Czyz et al., 2014).

For consonants, on the other hand, a handful of studies have investigated phonetic discrimination in young infants. Two studies used looking-time recording procedures to familiarize or habituate French-learning infants to one specific vowel-consonant-vowel sequence processed in one vocoder condition. The findings reveal that 6-month-olds were able to distinguish /aba/ from /apa/ when the slowest (<16 Hz) AM cues were preserved in only 32 bands, but that they required an increased time of listening to display this behavior compared to a condition where the original (<ERB/2) AM cues were preserved (Cabrera et al., 2013, 2015a). These studies demonstrate that 6-month-old infants can effectively use slow (<16 Hz) AM cues for consonant voicing or place discrimination, but that faster AM cues may play an important role in early phonetic discrimination.

Results along this line were also found in younger infants in a more recent study by Cabrera and Werner (2017) using an observer-based psychophysical procedure, the method used in the present study. English-speaking adult and English-learning 3-month-old participants were presented with one of five consonant categories (voiceless, voiced, labial, coronal, velar). In a yes-no task, participants were presented with a series of background syllables that exemplified the category under examination (e.g., voiced syllables like /ba/, /da/, /ga/, randomly repeated). They were evaluated based on their ability to detect change trials, where a single randomly selected “target” syllable (e.g., voiceless syllables like /pa/, /ta/, or /ka/) was played, and to withhold responses during no-change trials, where a background syllable was presented.

Both infants and adults were tested on their ability to discriminate consonants under quiet or noisy conditions in two vocoder conditions: (1) Fast AM, in which the original AM (filtered < 256 Hz) was preserved in 32 bands and FM was replaced by a pure tone, and (2) Slow AM, in which only the slowest AM (filtered < 8 Hz) was preserved in 32 bands and FM was replaced by a pure tone. Adults were able to discriminate consonants in both vocoder conditions in quiet environments. However, in noisy environments, the percentage of adults who correctly discriminated the consonant changes decreased from 70% to 20% between the Fast and the Slow AM conditions. These results confirmed that the slowest AM cues are not sufficient for adults consonant discrimination in noise.

Infants did not discriminate consonants equally in both vocoder conditions in quiet environments. The percentage of infants who discriminated decreased from 81% to 50% between the Fast and Slow AM conditions. In noisy environments, a similar pattern emerged, with the percentage of infants discriminating decreasing from 96% to 48% between the Fast and Slow AM conditions.

In summary, these first infant studies using vocoders suggest that 3- and 6-month-old infants may not rely on exactly the same spectro-temporal modulations as adults when processing phonemes. However, the age at which infants start to use, or weight, the acoustic cues found to be used by adults to process speech remains unknown.

The Current Study: Investigating the Development of Auditory Processing

The present study aims to investigate the development of the early auditory processing of speech to provide further insights into the acquisition of the phonological properties specific to one's native language. Interestingly, during the first year of life, infants show asynchronous perceptual attunement to the vowels and the consonants of their native language. Specifically, infants start becoming attuned to native language vowels around 4-6 months of age (Trehub, 1976; Kuhl et al., 1992; Polka and Werker, 1994), earlier than when they start becoming attuned to native language consonants around 8-10 months of age (Trehub, 1976; Werker and Tees, 1984; Best et al., 1988, 1995).

Furthermore, at the lexical level, differences in processing of vowels and consonants are also found, showing a shift in infants reliance from vowels to consonants between 6 and 8-11 months of age when detecting word forms (Bouchon et al., 2015; Poltrock and Nazzi, 2015; Nazzi et al., 2016; Nishibayashi and Nazzi, 2016). While no study to date has explored this issue directly, one study investigated the development of spectro-temporal cue weighting in a cross-linguistic study comparing French- versus Mandarin-learning infants.

Spectrogram

Spectrogram of a speech signal, illustrating spectro-temporal cues.

Cabrera et al. (2015b) investigated whether native language exposure influences reliance upon AM and FM cues in a discrimination task measuring looking times for two syllables varying in lexical tone (that is a change in pitch at the syllable level, such contrasts being phonological in tonal languages such as Mandarin Chinese, but not in French). Results showed that at 6 months, French- and Mandarin-learning infants display the same pattern of response: they detected a change in lexical tones in an intact condition (without acoustic degradation), suggesting that French-learning infants were not yet attuned to this speech contrast, and both groups did not detect the change when fine spectral and FM cues were degraded, showing that these acoustic cues are required for lexical-tone detection at 6 months. However, at 10 months, an influence of language background was observed: Mandarin-learning 10-month-olds showed the same pattern of response as 6-month-olds, but French-learning 10-month-olds were not able to detect the lexical-tone change in the intact condition, showing perceptual reorganization for this speech contrast. Moreover, French-learning 10-month-olds were able to discriminate the lexical tones when fine spectral and FM cues were degraded.

Accordingly, the current study focused on infants of 6 and 10 months of age exposed to French and compares their reliance upon FM and AM cues when detecting native vowel or consonant feature contrasts to assess whether with age infants rely more on slow or faster temporal cues when processing native phonemes.

The present study will extend the findings of Cabrera and Werner (2017), using an observer-based psychophysical yes-no task to measure the proportions of listeners able to detect a phonetic change in two vocoder conditions. Three groups of participants were tested in the exact same experimental conditions and setup: 6-month-olds, who have started to attune to the vowels but not the consonants of their native language; 10-month-olds, who have started to attune to both vowels and consonants of their native language; and adults.

Eight phonetic conditions were designed to assess the ability of listeners to detect a change in: Vowel Place, Vowel Height, Consonant Place and Consonant Voicing, each tested in two vocoder conditions, a Fast AM condition (preserving the original AM cues by using a cutoff frequency of ERB/2 that preserves fast and slow AM, in 32 bands, with reduced FM cues), and a Slow AM condition (preserving only the slowest AM cues below 8 Hz, in 32 bands, with reduced FM cues). Listeners were exposed to only one phonetic feature contrast in its two vocoder conditions, starting with the Fast AM condition and then, if they succeeded, moving to the Slow AM condition.

Based on prior behavioral studies, we expected a higher success rate among 6-month-olds in the Fast AM condition compared to the Slow AM condition, as these infants typically exhibit a stronger weighting of Fast AM cues. Nonetheless, as 6-month-olds have already started to attune to the vowels of their native language, we hypothesized that temporal degradation may have a more pronounced effect on consonant detection than on vowel detection. For 10-month-olds, we predicted similar performance for both the Fast AM and Slow AM conditions, as they have started to attune to both vowels and consonants of their native language. As such, we expected any difference between the effect of temporal degradation on vowels and on consonants to be less pronounced in 10-month-olds than in 6-month-olds.

Participants were recruited through the Babylab Participant Pool at the Integrative Neuroscience and Cognition Center. The data of 40 6-month-old infants (mean: 28.2 weeks, range: 25.9 weeks-31.8 weeks; 24 girls, 16 boys), 40 10-month-old infants (mean: 45.9 weeks, range: 42.4 weeks-49.6 weeks; 16 girls, 24 boys) and 20 adults (mean: 21 years; range: 18 to 29 years; 13 females, 7 males) were included in the analyses. All infants were born full term, had ...

Participant Demographics
Group Number of Participants Mean Age Age Range Female Male
6-Month-Old Infants 40 28.2 weeks 25.9 weeks - 31.8 weeks 24 16
10-Month-Old Infants 40 45.9 weeks 42.4 weeks - 49.6 weeks 16 24
Adults 20 21 years 18 - 29 years 13 7