Ap Cam

Find The Best Tech Web Designs & Digital Insights

Technology and Design

The McGurk Effect: How Visual Cues Influence Speech Perception

The McGurk effect, first reported by McGurk and MacDonald in 1976, is a striking multisensory illusion that occurs with audiovisual speech. It demonstrates how visual information can alter our perception of auditory speech, leading to a unified, integrated percept.

In the original experiment, researchers recorded a voice articulating a consonant and then dubbed it with a face articulating a different consonant. Even when the acoustic speech signal was easily recognized on its own, participants perceived a different consonant when it was paired with incongruent visual speech. This phenomenon, known as the McGurk effect, highlights the brain's ability to integrate information from multiple senses to create a coherent perception.

The McGurk effect has been replicated extensively and has inspired a wealth of research due to its powerful demonstration of multisensory integration. It shows that auditory and visual information is merged into a unified, integrated percept.

McGurk Effect Diagram

Diagram of the McGurk effect.

Defining the McGurk Effect

The McGurk effect can be defined as a categorical change in auditory perception induced by incongruent visual speech, resulting in a single percept of hearing something other than what the voice is saying. There are many variants of the McGurk effect. The best-known case is when dubbing a voice saying [b] onto a face articulating [g] results in hearing [d]. This is called the fusion effect since the percept differs from the acoustic and visual components.

Many researchers have defined the McGurk effect exclusively as the fusion effect because here integration results in the perception of a third consonant, obviously merging information from audition and vision. This definition ignores the fact that other incongruent audiovisual stimuli produce different types of percepts. For example, a reverse combination of these consonants, A[g]V[b], is heard as [bg], i.e., the visual and auditory components one after the other.

There are other pairings, which result in hearing according to the visual component, e.g., acoustic [b] presented with visual [d] is heard as [d]. The definition of the McGurk effect should be that an acoustic utterance is heard as another utterance when presented with discrepant visual articulation. This definition includes all variants of the illusion, and it has been used by MacDonald and McGurk (1978) themselves, as well as by several others.

The different variants of the McGurk effect represent the outcome of audiovisual integration. When integration takes place, it results in a unified percept, without access to the individual components that contributed to the percept.

Interpreting the McGurk Effect

One challenge with this interpretation of the McGurk effect is that it is impossible to be certain that the responses the observer gives correspond to the actual percepts. The real McGurk effect arises due to multisensory integration, resulting in an altered auditory percept. However, if integration does not occur, the observer can perceive the components separately and may choose to respond either according to what he heard or according to what he saw.

This is one reason why the fusion effect is so attractive: If the observer reports a percept that differs from both stimulus components, he does not seem to rely on either modality alone, but instead really fuse the information from both. The perception of the acoustic and visual stimulus components has to be taken into account when interpreting the McGurk effect.

In general, the strength of the McGurk effect is taken to increase when the proportion of responses according to the acoustic component decreases and/or when the proportion of fusion responses increases. That is, the McGurk effect for stimulus A[b]V[g] is considered stronger when fewer B responses and/or more D responses are given. This is often an adequate way to measure the strength of the McGurk effect-if one keeps in mind that it implicitly assumes that perception of the acoustic and visual components is accurate (or at least constant across conditions that are compared).

The fusion effect provides a prime example of this caveat. It has been interpreted to mean that acoustic and visual information is integrated to produce a novel, intermediate percept. For example, when A[b]V[g] is heard as [d], the percept is thought to emerge due to fusion of the features (for the place of articulation) provided via audition (bilabial) and vision (velar), so that a different, intermediate consonant (alveolar) is perceived.

However, already McGurk and MacDonald (1976) themselves wrote that “lip movements for [ga] are frequently misread as [da],” even though they did not measure speechreading performance, unfortunately. The omission of the unisensory visual condition in the original study is one factor that has contributed to the strong status of the fusion effect as the only real McGurk effect, reflecting true integration.

Examples of Research on the McGurk Effect

To demonstrate the contribution of the unisensory components more explicitly, let's take two examples of research, in which fusion-type stimuli produced different percepts depending on the clarity of the visual component.

  • In one study, a McGurk stimulus A[epe]V[eke] was mainly heard as a fusion [ete] (Tiippana et al., 2004). This reflected the fact that in a visual-only identification task, the visual [eke] was confused with [ete] (42% K responses and 45% T responses to visual [eke]).
  • In another study, a McGurk stimulus A[apa]V[aka] was mainly heard as [aka], and this could be traced back to the fact that in a visual-only identification task, the visual [aka] was clearly distinguishable from [ata], and thus recognized very accurately (100% correct in typical adults; Saalasti et al., 2012; but note the deviant behavior of individuals with Asperger syndrome).

Thus, even though the McGurk stimuli were of a fusion type in both studies, their perception differed depending largely on the clarity of the visual components.

Exactly how to take the properties of the unisensory components into account in multisensory perception of speech is beyond this paper. Addressing this issue in detail requires carefully designed experimental studies, computational modeling, and investigation of the underlying brain mechanisms.

McGurk Effect Demonstration and Explanation

The McGurk Effect and Multisensory Integration

During experiments, when the task is to report what was heard, the observer reports the conscious auditory percept evoked by the audiovisual stimulus. If there is no multisensory integration or interaction, the percept is identical for the audiovisual stimulus and the auditory component presented alone. If there is audiovisual integration, the conscious auditory percept changes.

To which extent visual input influences the percept depends on how coherent and reliable information each modality provides. This perceptual process is the same for audiovisual speech-be it natural, congruent audiovisual speech or artificial, incongruent McGurk speech stimuli. The outcome is the conscious auditory percept.

Depending on the relative weighting of audition and vision, the outcome for McGurk stimuli can range from hearing according to the acoustic component (when audition is more reliable than vision) to fusion and combination percepts (when both modalities are informative to some extent) to hearing according to the visual component (when vision is more reliable than audition).

Congruent audiovisual speech is treated no differently, showing visual influence when the auditory reliability decreases. The McGurk effect is an excellent tool to investigate multisensory integration in speech perception.

Causal Inference Model

A Causal Inference Model Explains Perception of the McGurk Effect and Other Incongruent Audiovisual Speech. During face-to-face conversations, we seamlessly integrate information from the talker’s voice with information from the talker’s face. This multisensory integration increases speech perception accuracy and can be critical for understanding speech in noisy environments with many people talking simultaneously. A major challenge for models of multisensory speech perception is thus deciding which voices and faces should be integrated.

Our solution to this problem is based on the idea of causal inference-given a particular pair of auditory and visual syllables, the brain calculates the likelihood they are from a single vs. multiple talkers and uses this likelihood to determine the final speech percept. We compared our model with an alternative model that is identical, except that it always integrated the available cues. Using behavioral speech perception data from a large number of subjects, the model with causal inference better predicted how humans would (or would not) integrate audiovisual speech syllables.

Methods

In everyday environments, we encounter audiovisual speech and must decide whether the auditory and visual components of the speech emanate from a single talker (C = 1) or two separate talkers (C = 2; Fig 1A). Most studies of multisensory integration assume that C = 1 and focus on the details of the inference used to produce a single multisensory representation that is then categorized as a particular percept.

To carry out causal inference, we must perform the additional steps of calculating the C = 2 representation and then combining the C = 1 and C = 2 representations. This combined representation is then categorized as a particular percept.

We define a two-dimensional space as the minimum possible dimension for characterizing auditory and visual speech information (Fig 1B). The x-axis represents auditory information in the stimulus while the y-axis represents visual information in the stimulus. For simplicity, we model a space containing only 3 speech token categories, “ba,” “da,” and “ga”.

Based on behavioral confusability studies and previous modeling work, “da” was placed intermediate to “ba” and “ga” on both the auditory and visual axes, slightly closer to “ba” on the auditory x-axis, and slightly closer to “ga” on the visual y-axis. These syllable locations can be thought of as prototypes, with different talkers (or different utterances from the same talker) differing from the prototype.

To model these category distributions as simply as possible, we defined two-dimensional variance-covariance matrices (identical across syllables) with zero covariance (information in auditory and visual axes is uncorrelated) and equal variances along each axis. A staple of Bayesian models of perception is the concept of sensory noise. Not only do individual exemplar stimuli vary from their prototype, the perceived stimulus varies from its actual physical properties due to sensory noise. We model this as two-dimensional variance-covariance matrices representing Gaussian noise in each modality (ΣA and ΣV, for auditory and visual encoding) with variances inversely proportional to the precision of the modality.

Modalities are encoded separately, but through extensive experience with audiovisual speech, encoding a unisensory speech stimulus provides some information about the other modality. For instance, hearing an unisensory auditory “ba” informs the observer that the mouth of the talker must have been in an initially lips-closed position. For such a unisensory cue, the information provided about the other sensory modality has higher variance.

In our model, we assume that for auditory cues, the standard deviation along the visual axis is 1.5 times larger than the standard deviation along the auditory axis. For visual cues, the standard deviation along the auditory axis is 1.5 times larger than the standard deviation along the visual axis.

For each presentation of a given audiovisual stimulus, the model encodes each modality separately. For a single trial of a stimulus with auditory component SA and visual component SV, the model generates two vectors: the auditory representation and the visual representation , where is a normal distribution with mean μ and variance Σ.

To form the C = 1 representation, the model assumes Bayesian inference (integration of auditory and visual speech cues according to their reliabilities). We use the two-dimensional analog of the common Bayesian cue-integration rules as described by [2]. On each trial, we calculate the integrated representation as , where .

Without causal inference, the integrated representation, XAV, is the final representation. Although the representational space is continuous, speech perception is categorical. Therefore, to produce a categorical percept, we determine the syllable that is most likely to have generated the integrated representation: , where is the two-dimensional Gaussian density function, is the two-dimensional location of a particular syllable category and is the sum of the category’s variance-covariance matrix and the variance of XAV.

All model simulations were done in R [15], multivariate probabilities were calculated using the mvtnorm package [16, 17]. In the CIMS model, rather than assuming that C = 1, we take both the C = 1 and C = 2 representations into consideration, weighting them by their likelihood. For C = 1, the representation is the same as for the non-CIMS model (Fig 2A). For C = 2, the representation is simply the encoded representation of the auditory portion of the stimulus; this is reasonable because most incongruent pairings of auditory and visual speech result in perception of the auditory syllable

Demonstrating the McGurk Effect

The McGurk effect (named after Harry McGurk of McGurk & McDonald, 1976) is a compelling demonstration of how we all use visual speech information. The effect shows that we can't help but integrate visual speech into what we 'hear'.

Instructions:

  1. You will see and hear a mouth speaking four syllables. Watch the mouth closely, but concentrate on what you're hearing. The movie will repeat itself until it is stopped. You can watch it as many times as you need to be sure of the syllables you hear. After you feel certain of what you perceive, stop the movie and continue reading the text below. PLEASE START THE MOVIE NOW.
  2. Now start the movie again and close your eyes. Listen to the movie repeat until you are sure of what you hear. When you feel certain of what you hear, stop the movie and continue reading the text below. PLEASE RESTART THE MOVIE NOW.

If you're like most people, what you hear depends on whether your eyes are opened or closed. If you'd like, you can start the movie again, and as it repeats, switch between opening and closing your eyes. Your experience of what you hear should change. After you're convinced, stop the movie and read the explanation below. You can always play the movie again later.

How the stimuli were made:

These stimuli were made by dubbing a single repeated audio syllable onto four different visual syllables. Depending on the audiovisual syllable combination used:

  • the visual syllable can override the auditory syllable to determine what we perceive
  • the auditory and visual syllables can combine to produce a new perceived syllable
  • the auditory syllable can override the visual syllable to determine what we perceive

What the effect means:

The McGurk effect shows that visual articulatory information is integrated into our perception of speech automatically and unconsciously. The syllable that we perceive depends on the strength of the auditory and visual information, and whether some compromise can be achieved. Regardless, integration of the discrepant audiovisual speech syllables is effortless and mandatory. Our speech function makes use of all types of relevant information, regardless of the modality. In fact, there is some evidence that the brain treats visual speech information as if it is auditory speech.

How general is the McGurk effect?

  • The effect works on perceivers with all language backgrounds
  • The effect works on young infants
  • The effect works when the visual and auditory components are from speakers of different genders
  • The effect works with highly reduced face images
  • The effect works when observers are unaware that they are looking at a face
  • The effect works when observers touch-rather than look-at the face
  • The effect works less well with vowels than consonants
  • The effect works less well with nonspeech pluck & bow stimuli
  • The effect works better with some consonant combinations than others (e.g, McGurk & MacDonald).

To produce a 'live' demonstration of the McGurk effect: (you'll need two other people besides yourself)

  1. have an observer face you and keep looking at your face
  2. have another person stand behind you so the observer can't see their face
  3. starting synchronously, repeatedly mouth the word 'vase' (silently) while the person behind you repeats the word 'base' outloud -you can achieve synchronization by counting down '3, 2, 1. . vase, vase, vase, etc
  4. after about 8 repetitions, stop and ask the observer what they 'hear' -they should 'hear' vase
  5. now do the same thing, and this time tell the observer to shut their eyes after a few repetitions
  6. they should hear 'base' with their eyes shut
  7. the observer can try opening and shutting their eyes, and what they 'hear' should change from 'vase' to 'base'

Some tips on making your own McGurk Stimuli:

Audiovisual dubbing can be achieved by using two videotape players or digitizing stimuli onto a computer and using software to mix the audio and video components. The quality of the auditory channel should be good, but the quality of the visual channel can be fair without much loss in the effect. The auditory and visual components should be synchronized so that the sound of the syllable seems to be coming from the visible mouth. However, the components do not have to be perfectly synchronized for the effect to work. The syllable combinations used in the above demonstration are known to be especially strong.