Understanding Target Speech Hearing in Listeners with Sensorineural Hearing Loss
There is considerable evidence that sensorineural hearing loss (SNHL) adversely affects a listener's ability to hear and understand the speech of one specific talker in the presence of competing talkers. This evidence is varied and consists of subjective reports of the experience of communication difficulty as well as the findings from laboratory-based studies of speech-on-speech (SOS) masking. Despite this accumulation of evidence, the causes for the difficulties experienced by listeners with SNHL in multiple-talker sound fields remain poorly understood.
The limitations of our understanding of this problem are due in part to the complexity of solving the SOS masking task in general, which places demands on processing at multiple sites in the system spanning the auditory pathway from sensory transduction in the periphery to linguistic processing in the brain. This raises the possibility that adverse effects of SNHL on performance could occur at multiple stages of processing of the stimulus or at different stages for different individuals with otherwise similar audiometric profiles. Furthermore, the outcome of the SOS masking experiment itself is quite sensitive to the degree of uncertainty that is present in the listener's task.
Listener uncertainty can be a difficult variable to control or to quantify and the particulars of the experimental design-which determine the uncertainty inherent to the task-may exert a profound influence on the outcome of the experiment. In this article we examine these issues by testing a group of listeners with SNHL under conditions that exactly match those tested in an earlier study of listeners with NH.
Energetic Masking vs. Informational Masking
The focus of this study is on one aspect of this complicated problem: attempting to better understand the interaction between SNHL and the type of masking that is present in the listening environment. By “type of masking” we are referring to the distinction that often is made between masking that originates primarily in the auditory periphery, termed energetic masking (EM), and masking that originates primarily at higher levels of the auditory system, termed informational masking (IM). Energetic masking occurs due to the overlap of excitation in peripheral neural structures and limits the information available for processing at subsequent levels in the system while IM occurs despite a neural representation of the target stimulus from the periphery that is sufficient for the observer to solve the task.
Because the primary cause of IM is listener uncertainty, and therefore reflects the internal state of the observer, it can be difficult to isolate and control as a variable in experimental design and performance may depend on factors such as stimulus predictability and the a priori knowledge available to the observer. Furthermore, because the processing that occurs subsequent to peripheral transduction depends on the fidelity of the representations that are received at each successive stage, factors that affect EM may manifest as differences in performance on tasks intended to measure aspects of auditory perception and cognitive processing.
For listeners with SNHL, making effective use of incomplete or distorted peripheral representations may depend on rapidly “filling in” the missing information at some stage of processing and presumably draws heavily on the predictability provided by context. When attempting to determine the origins of the difficulties encountered by SNHL listeners in complex listening situations, such as in multiple-talker sound fields created in the laboratory, controlling the type of masking-EM vs IM-may be crucial to understanding the factors governing performance.
Among the more effective tools for separating EM from IM in SOS masking is “ideal time-frequency segregation” (ITFS), a technique originally developed as a means for studying computational auditory scene analysis but which was applied to the SOS masking problem by Brungart and colleagues. ITFS processing usually assumes a priori knowledge of the target and masker signals so that the energy relations in each time-frequency (T-F) unit may be specified exactly and, in that sense, is fundamentally different than normal human perception.
The use of ITFS to separate EM from IM is based on the premise that listeners extracting speech in a mixture of sounds likely rely only (or predominantly) on the subset of T-F units in which the energy of the target source relative to the energy of the masking source(s) exceeds a specified value (termed the level criterion, or LC). The T-F units in which masker energy dominates target energy logically fall under the definition of EM because of the assumption that the neural response in a small T-F unit would be driven by the properties (i.e., amplitude and timing) of the higher-energy source.
Thus, performing ITFS on a SOS mixture, and using the processed stimulus as the target speech, emulates EM because it eliminates the T-F units dominated by masker energy which presumably contribute little to target speech intelligibility. However, two relevant points about ITFS processing are that, first, the extent in time and frequency of the T-F units used in the analysis may not be matched to the limits of the internal representations in the human auditory system. Second, although the assumption that ITFS processing removes the energetically masked T-F units is reasonable and straight-forward, ITFS processing also eliminates the IM that is present in those masker-dominated T-F units.
Thus, ITFS processing may achieve the analog of perceptual segregation almost perfectly. What this means is that the difference in intelligibility observed between the “glimpsed” target stimulus (the speech that remains after ITFS processing) and the intelligibility of the masked “natural” target stimulus (prior to ITFS processing)-which is termed “additional masking”-is an indicator of the IM caused by the masker.
In the present study, performance in SOS masking conditions was measured in a group of young adult listeners with SNHL for conditions in which the stimulus was presented naturally and for the same conditions following ITFS processing. Furthermore, these two processing approaches (termed “natural” and “glimpsed,” respectively) were applied to several types of maskers and in a variety of presentation conditions including a high IM baseline, three conditions in which explicit source segregation cues were provided, and a noise masker condition that was intended as a low IM control.
These stimuli and presentation conditions allowed us to address the following sets of questions:
- First, do listeners with SNHL exhibit greater than normal EM and/or IM in SOS masking conditions? And, conversely, how does the benefit obtained from stimulus manipulations intended to perceptually segregate target speech from masker speech in listeners with SNHL compare to the corresponding benefit found for listeners with NH?
- Second, is the release from SOS masking due to various source segregation cues correlated among individuals with SNHL? Recent work indicates that NH listeners tend to be categorized according to their general ability to use source segregation cues to overcome IM. Would this same pattern be observed for listeners with SNHL? A related question is whether the release from masking observed for individual listeners with SNHL depends on the degree of hearing loss. These questions will be addressed using a correlational approach that parallels that reported by Kidd et al. (2016).
- Third, does a constant level of speech identification performance (i.e., at the 50% correct “threshold” point) correspond to the availability of a constant proportion of target energy in the glimpsed stimulus? Because ITFS processing essentially removes the effect of the masker-any masker-it is plausible that all that matters in the glimpsed stimulus is how much of the target remains. This would imply that a constant level of target identification performance would be directly related to a constant proportion of the target information retained after processing.
If that is not the case across listeners and/or conditions, it would suggest that other factors influence the results. Some of the factors that could lead to this result include insufficient audibility of the stimulus due to inadequate amplification and/or a reduced ability of the listener with SNHL to constitute a coherent stream of speech from the sparse “glimpses” of the target that are available in the ITFS processed/reconstructed speech. Comparison of glimpsed target speech intelligibility across listeners and maskers may help to answer this question.
Participants
Twelve adult listeners ranging in age from 18 to 40 years with bilateral SNHL participated. Ten listeners participated in the main portion of the study while in a subsequent condition using a noise masker, eight SNHL listeners participated including six of the original listeners plus two additional listeners.
The correlation between age and hearing loss (as defined by the 4-frequency pure-tone average of 500, 1000, 2000, and 4000 Hz) was not significant, with r = 0.13 and p(1-tailed) = 0.34. As a group the hearing losses were bilaterally symmetric and sloped about 10 dB/octave over the range of frequencies tested. All subjects received compensation for their participation in the experiments. For the SNHL listeners the stimuli were amplified using individualized linear gain according to the NAL-RP prescriptive formula determined separately for each ear. Levels are specified prior to the application of gain.

Group mean audiometric thresholds and standard errors of the means for the subjects with sensorineural hearing loss.
Speech Materials
The speech materials consisted of a set of 40 monosyllabic words divided into five syntactic categories: <name> <verb> <number> <adjective> <object> with eight exemplars in each category. The speech targets and maskers were the same as those that were used in Kidd et al. (2016). The words were spoken by 16 young-adult talkers with an equal number of males and females. The average fundamental frequency of the female talkers was 201 Hz (s.d. 20 Hz) and of the male talkers was 109 Hz (s.d. 12 Hz). The recordings were produced by Sensimetrics Corporation (Malden, MA). All talkers recorded all words.
The words were spoken individually so that they could be concatenated in any order without causing acoustic differences due to across-word coarticulation. The sentences were used for both the “target” that was to be identified and the speech “maskers” that were to be ignored. The target sentence was designated by the name “Sue” which was used for the first word in that sentence and was not scored. The remaining four words in the target sentence were chosen at random from the eight exemplars in each of the other four syntactic categories (after names) and were scored. An example is “Sue found three red shoes.”
An 8 × 5 grid of the corpus of test words was displayed on the subject's monitor. The subjects were instructed to mouse-click the five words in order, one from each column left to right, comprising the target sentence. The noise maskers were speech-spectrum-shaped (based on the measured long-term average spectrum of our group of female talkers) and speech-envelope-modulated (using the envelopes of the recorded words as spoken in the speech masker conditions).