The McGurk Effect: How Seeing Lips Can Change What You Hear
The McGurk effect is a perceptual phenomenon that demonstrates an interaction between hearing and vision in speech perception. It illustrates that what we hear is not solely based on auditory information, but also influenced by visual cues, such as lip movements. This effect provides valuable insights into how our brains integrate information from different senses to create a unified perception of the world.
McGurk and MacDonald (1976) reported a powerful multisensory illusion occurring with audiovisual speech. Even though the acoustic speech signal was well recognized alone, it was heard as another consonant after dubbing with incongruent visual speech. The illusion has been termed the McGurk effect. It has been replicated many times, and it has sparked an abundance of research. The reason for the great impact is that this is a striking demonstration of multisensory integration. It shows that auditory and visual information is merged into a unified, integrated percept.
The McGurk effect (named after Harry McGurk of McGurk & McDonald, 1976) is a compelling demonstration of how we all use visual speech information. The effect shows that we can't help but integrate visual speech into what we 'hear'.

Try the demonstration now, and then read about how the stimuli were made, what the effect means, and how to produce your own McGurk effects.
Instructions: You will see and hear a mouth speaking four syllables. Watch the mouth closely, but concentrate on what you're hearing. The movie will repeat itself until it is stopped. You can watch it as many times as you need to be sure of the syllables you hear. After you feel certain of what you perceive, stop the movie and continue reading the text below.
Now start the movie again and close your eyes. Listen to the movie repeat until you are sure of what you hear. When you feel certain of what you hear, stop the movie and continue reading the text below.
If you're like most people, what you hear depends on whether your eyes are opened or closed. If you'd like, you can start the movie again, and as it repeats, switch between opening and closing your eyes. Your experience of what you hear should change. After you're convinced, stop the movie and read the explanation below. You can always play the movie again later.
Defining the McGurk Effect
Here I shall make two main claims regarding the definition and interpretation of the McGurk effect since they bear relevance to its use as a measure of multisensory integration. First, the McGurk effect should be defined as a categorical change in auditory perception induced by incongruent visual speech, resulting in a single percept of hearing something other than what the voice is saying. There are many variants of the McGurk effect (McGurk and MacDonald, 1976; MacDonald and McGurk, 1978)1. The best-known case is when dubbing a voice saying [b] onto a face articulating [g] results in hearing [d]. This is called the fusion effect since the percept differs from the acoustic and visual components.
Many researchers have defined the McGurk effect exclusively as the fusion effect because here integration results in the perception of a third consonant, obviously merging information from audition and vision (van Wassenhove et al., 2007; Keil et al., 2012; Setti et al., 2013). This definition ignores the fact that other incongruent audiovisual stimuli produce different types of percepts. For example, a reverse combination of these consonants, A[g]V[b], is heard as [bg], i.e., the visual and auditory components one after the other. There are other pairings, which result in hearing according to the visual component, e.g., acoustic [b] presented with visual [d] is heard as [d].
Here my first claim is that the definition of the McGurk effect should be that an acoustic utterance is heard as another utterance when presented with discrepant visual articulation. This definition includes all variants of the illusion, and it has been used by MacDonald and McGurk (1978) themselves, as well as by several others (e.g., Rosenblum and Saldaña, 1996; Brancazio et al., 2003). The different variants of the McGurk effect represent the outcome of audiovisual integration.
Variants of the McGurk Effect
The different variants of the McGurk effect represent the outcome of audiovisual integration. When integration takes place, it results in a unified percept, without access to the individual components that contributed to the percept.
- Fusion: The perceived sound is a blend of the auditory and visual inputs (e.g., hearing "da" when the auditory input is "ba" and the visual input is "ga").
- Combination: The perceived sound combines elements of both the auditory and visual inputs (e.g., hearing "bg" when the auditory input is "ga" and the visual input is "ba").
- Visual Capture: The perceived sound matches the visual input, overriding the auditory input (e.g., hearing "ga" when the auditory input is "ba" and the visual input is "ga").
Interpreting the McGurk Effect
One challenge with this interpretation of the McGurk effect is that it is impossible to be certain that the responses the observer gives correspond to the actual percepts. The real McGurk effect arises due to multisensory integration, resulting in an altered auditory percept. However, if integration does not occur, the observer can perceive the components separately and may choose to respond either according to what he heard or according to what he saw. This is one reason why the fusion effect is so attractive: If the observer reports a percept that differs from both stimulus components, he does not seem to rely on either modality alone, but instead really fuse the information from both.
The second main claim here is that the perception of the acoustic and visual stimulus components has to be taken into account when interpreting the McGurk effect. This issue has been elaborated previously in the extensive work by Massaro and colleagues (Massaro, 1998) and others (Sekiyama and Tohkura, 1991; Green and Norrix, 1997; Jiang and Bernstein, 2011). In general, the strength of the McGurk effect is taken to increase when the proportion of responses according to the acoustic component decreases and/or when the proportion of fusion responses increases. That is, the McGurk effect for stimulus A[b]V[g] is considered stronger when fewer B responses and/or more D responses are given. This is often an adequate way to measure the strength of the McGurk effect-if one keeps in mind that it implicitly assumes that perception of the acoustic and visual components is accurate (or at least constant across conditions that are compared).
The fusion effect provides a prime example of this caveat. It has been interpreted to mean that acoustic and visual information is integrated to produce a novel, intermediate percept. For example, when A[b]V[g] is heard as [d], the percept is thought to emerge due to fusion of the features (for the place of articulation) provided via audition (bilabial) and vision (velar), so that a different, intermediate consonant (alveolar) is perceived (van Wassenhove, 2013). However, already McGurk and MacDonald (1976) themselves wrote that “lip movements for [ga] are frequently misread as [da],” even though they did not measure speechreading performance, unfortunately.
The omission of the unisensory visual condition in the original study is one factor that has contributed to the strong status of the fusion effect as the only real McGurk effect, reflecting true integration. To demonstrate the contribution of the unisensory components more explicitly, I'll take two examples of my research, in which fusion-type stimuli produced different percepts depending on the clarity of the visual component. In one study, a McGurk stimulus A[epe]V[eke] was mainly heard as a fusion [ete] (Tiippana et al., 2004). This reflected the fact that in a visual-only identification task, the visual [eke] was confused with [ete] (42% K responses and 45% T responses to visual [eke]). In another study, a McGurk stimulus A[apa]V[aka] was mainly heard as [aka], and this could be traced back to the fact that in a visual-only identification task, the visual [aka] was clearly distinguishable from [ata], and thus recognized very accurately (100% correct in typical adults; Saalasti et al., 2012; but note the deviant behavior of individuals with Asperger syndrome). Thus, even though the McGurk stimuli were of a fusion type in both studies, their perception differed depending largely on the clarity of the visual components.
| Study | McGurk Stimulus | Main Percept | Visual-Only Identification |
|---|---|---|---|
| Tiippana et al. (2004) | A[epe]V[eke] | [ete] (fusion) | [eke] confused with [ete] (42% K, 45% T) |
| Saalasti et al. (2012) | A[apa]V[aka] | [aka] | [aka] clearly distinguishable from [ata] (100% correct) |
Factors Influencing the McGurk Effect
Several factors can influence the strength and occurrence of the McGurk effect:
- Clarity of Visual Input: Clearer visual articulation leads to a stronger visual influence on perception.
- Reliability of Auditory Input: The more ambiguous or degraded the auditory signal, the more likely vision will dominate.
- Individual Differences: Susceptibility to the McGurk effect can vary based on individual perceptual abilities and cognitive factors.
- Attention: Attentional focus on visual or auditory cues can modulate the effect.
Exactly how to take the properties of the unisensory components into account in multisensory perception of speech is beyond this paper. Addressing this issue in detail requires carefully designed experimental studies (Bertelson et al., 2003; Alsius et al., 2005), computational modeling (Massaro, 1998; Schwartz, 2010), and investigation of the underlying brain mechanisms (Sams et al., 1991; Skipper et al., 2007).
The McGurk Effect as a Tool for Research
The McGurk effect is an excellent tool to investigate multisensory integration in speech perception. During experiments, when the task is to report what was heard, the observer reports the conscious auditory percept evoked by the audiovisual stimulus. If there is no multisensory integration or interaction, the percept is identical for the audiovisual stimulus and the auditory component presented alone. If there is audiovisual integration, the conscious auditory percept changes. To which extent visual input influences the percept depends on how coherent and reliable information each modality provides.
This perceptual process is the same for audiovisual speech-be it natural, congruent audiovisual speech or artificial, incongruent McGurk speech stimuli. The outcome is the conscious auditory percept. Depending on the relative weighting of audition and vision, the outcome for McGurk stimuli can range from hearing according to the acoustic component (when audition is more reliable than vision) to fusion and combination percepts (when both modalities are informative to some extent) to hearing according to the visual component (when vision is more reliable than audition). Congruent audiovisual speech is treated no differently, showing visual influence when the auditory reliability decreases.
How the stimuli were made
These stimuli were made by dubbing a single repeated audio syllable onto four different visual syllables. To see what syllables were used for this demonstration, please point your cursor below.
Depending on the audiovisual syllable combination used:
- the visual syllable can overide the auditory syllable to determine what we perceive
- the auditory and visual syllables can combine to produce a new perceived syllable
- the auditory syllable can overide the visual syllable to determine what we perceive
What the effect means
The McGurk effect shows that visual articulatory information is integrated into our perception of speech automatically and unconsciously. The syllable that we perceive depends on the strength of the auditory and visual information, and whether some compromise can be achieved. Regardless, integration of the discrepant audiovisual speech syllables is effortless and mandatory. Our speech function makes use of all types of relevant information, regardless of the modality. In fact, there is some evidence that the brain treats visual speech information as if it is auditory speech .
How general is the McGurk effect?
The effect works on perceivers with all language backgrounds (e.g., Massaro, Cohen, Gesi, Heredia, & Tsuzaki, 1993; Sekiyama. & Tokhura, 1993)
The effect works on young infants (Rosenblum, Schmuckler, & Johnson, 1997).
The effect works when the visual and auditory components are from speakers of different genders (Green, Kuhl, Meltzoff, & Stevens, 1991).
The effect works with highly reduced face images (Rosenblum & Saldaña, 1996).
The effect works when observers are unaware that they are looking at a face (Rosenblum & Saldaña, 1996).
The effect works when observers touch-rather than look-at the face (Fowler & Dekle, 1991).
The effect works less well with vowels than consonants (Summerfield & McGrath, 1984).
The effect works less well with nonspeech pluck & bow stimuli (Saldaña & Rosenblum, 1994).
The effect works better with some consonant combinations than others (e.g, McGurk & MacDonald).
To produce a 'live' demonstration of the McGurk effect:
(you'll need two other people besides yourself)
- have an observer face you and keep looking at your face
- have another person stand behind you so the observer can't see their face
- starting synchronously, repeatedly mouth the word 'vase' (silently) while the person behind you repeats the word 'base' outloud -you can acheive synchronization by counting down '3, 2, 1. . vase, vase, vase, etc
- after about 8 repetitions, stop and ask the observer what they 'hear' -they should 'hear' vase
- now do the same thing, and this time tell the observer to shut their eyes after a few repetitions
- they should hear 'base' with their eyes shut
- the observer can try opening and shutting their eyes, and what they 'hear' should change from 'vase' to 'base'
Some tips on making your own McGurk Stimuli:
Audiovisual dubbing can be achieved by using two videotape players or digitizing stimuli onto a computer and using software to mix the audio and video components. The qualtiy of the auditory channel should be good, but the quality of the visual channel can be fair without much loss in the effect. The auditory and visual components should be synchronized so that the sound of the syllable seems to be coming from the visible mouth. However, the components do not have to be perfectly synchronized for the effect to work.
The syllable combinations used in the above demonstration are known to be especially strong.