The McGurk Effect: How Seeing Lips Affects What You Hear
The McGurk effect is a fascinating psychoacoustic phenomenon that highlights the complex way our brains integrate auditory and visual information when perceiving speech. It demonstrates that what we hear is not solely based on the acoustic signal but is also influenced by what we see, specifically the movements of the speaker's lips and face.
McGurk and MacDonald (1976) first reported this powerful multisensory illusion occurring with audiovisual speech. Even though the acoustic speech signal was well recognized alone, it was heard as another consonant after dubbing with incongruent visual speech. The illusion has been termed the McGurk effect.

The McGurk effect shows that visual articulatory information is integrated into our perception of speech automatically and unconsciously. The syllable that we perceive depends on the strength of the auditory and visual information, and whether some compromise can be achieved. Regardless, integration of the discrepant audiovisual speech syllables is effortless and mandatory. Our speech function makes use of all types of relevant information, regardless of the modality.
Defining the McGurk Effect
First, the McGurk effect should be defined as a categorical change in auditory perception induced by incongruent visual speech, resulting in a single percept of hearing something other than what the voice is saying. Here my first claim is that the definition of the McGurk effect should be that an acoustic utterance is heard as another utterance when presented with discrepant visual articulation. This definition includes all variants of the illusion, and it has been used by MacDonald and McGurk (1978) themselves, as well as by several others (e.g., Rosenblum and Saldaña, 1996; Brancazio et al., 2003). The different variants of the McGurk effect represent the outcome of audiovisual integration.
There are many variants of the McGurk effect (McGurk and MacDonald, 1976; MacDonald and McGurk, 1978). The best-known case is when dubbing a voice saying [b] onto a face articulating [g] results in hearing [d]. This is called the fusion effect since the percept differs from the acoustic and visual components. Many researchers have defined the McGurk effect exclusively as the fusion effect because here integration results in the perception of a third consonant, obviously merging information from audition and vision (van Wassenhove et al., 2007; Keil et al., 2012; Setti et al., 2013). This definition ignores the fact that other incongruent audiovisual stimuli produce different types of percepts.
For example, a reverse combination of these consonants, A[g]V[b], is heard as [bg], i.e., the visual and auditory components one after the other. There are other pairings, which result in hearing according to the visual component, e.g., acoustic [b] presented with visual [d] is heard as [d].
When integration takes place, it results in a unified percept, without access to the individual components that contributed to the percept. If there is no multisensory integration or interaction, the percept is identical for the audiovisual stimulus and the auditory component presented alone. If there is audiovisual integration, the conscious auditory percept changes.
The Role of Unisensory Components
The second main claim here is that the perception of the acoustic and visual stimulus components has to be taken into account when interpreting the McGurk effect. This issue has been elaborated previously in the extensive work by Massaro and colleagues (Massaro, 1998) and others (Sekiyama and Tohkura, 1991; Green and Norrix, 1997; Jiang and Bernstein, 2011).
In general, the strength of the McGurk effect is taken to increase when the proportion of responses according to the acoustic component decreases and/or when the proportion of fusion responses increases. That is, the McGurk effect for stimulus A[b]V[g] is considered stronger when fewer B responses and/or more D responses are given. This is often an adequate way to measure the strength of the McGurk effect-if one keeps in mind that it implicitly assumes that perception of the acoustic and visual components is accurate (or at least constant across conditions that are compared).
The fusion effect provides a prime example of this caveat. It has been interpreted to mean that acoustic and visual information is integrated to produce a novel, intermediate percept. For example, when A[b]V[g] is heard as [d], the percept is thought to emerge due to fusion of the features (for the place of articulation) provided via audition (bilabial) and vision (velar), so that a different, intermediate consonant (alveolar) is perceived (van Wassenhove, 2013). However, already McGurk and MacDonald (1976) themselves wrote that “lip movements for [ga] are frequently misread as [da],” even though they did not measure speechreading performance, unfortunately. The omission of the unisensory visual condition in the original study is one factor that has contributed to the strong status of the fusion effect as the only real McGurk effect, reflecting true integration.
To demonstrate the contribution of the unisensory components more explicitly, I'll take two examples of my research, in which fusion-type stimuli produced different percepts depending on the clarity of the visual component. In one study, a McGurk stimulus A[epe]V[eke] was mainly heard as a fusion [ete] (Tiippana et al., 2004). This reflected the fact that in a visual-only identification task, the visual [eke] was confused with [ete] (42% K responses and 45% T responses to visual [eke]). In another study, a McGurk stimulus A[apa]V[aka] was mainly heard as [aka], and this could be traced back to the fact that in a visual-only identification task, the visual [aka] was clearly distinguishable from [ata], and thus recognized very accurately (100% correct in typical adults; Saalasti et al., 2012; but note the deviant behavior of individuals with Asperger syndrome). Thus, even though the McGurk stimuli were of a fusion type in both studies, their perception differed depending largely on the clarity of the visual components.
Depending on the relative weighting of audition and vision, the outcome for McGurk stimuli can range from hearing according to the acoustic component (when audition is more reliable than vision) to fusion and combination percepts (when both modalities are informative to some extent) to hearing according to the visual component (when vision is more reliable than audition). Congruent audiovisual speech is treated no differently, showing visual influence when the auditory reliability decreases.
Factors Influencing the McGurk Effect
The McGurk effect is not uniform across all individuals and conditions. Several factors can influence its strength and manifestation:
- Neurological Conditions: Conditions such as Alzheimer’s disease, schizophrenia, autism spectrum disorder, and brain damage can alter an individual's susceptibility to the McGurk effect.
- Attention: The amount of attention one pays to the visual or auditory stimuli can affect the integration process.
- Clarity of Stimuli: The clarity and distinctness of both the auditory and visual components play a crucial role. Ambiguous visual cues may lead to a stronger reliance on auditory information, and vice versa.
- Language Background: While the McGurk effect is generally robust across languages, some studies suggest that specific linguistic experiences can influence its perception.
How General is the McGurk Effect?
The McGurk effect is a widespread and robust phenomenon, demonstrating its generalizability across various conditions and populations:
- The effect works on perceivers with all language backgrounds (e.g., Massaro, Cohen, Gesi, Heredia, & Tsuzaki, 1993; Sekiyama. & Tokhura, 1993)
- The effect works on young infants (Rosenblum, Schmuckler, & Johnson, 1997).
- The effect works when the visual and auditory components are from speakers of different genders (Green, Kuhl, Meltzoff, & Stevens, 1991).
- The effect works with highly reduced face images (Rosenblum & Saldaña, 1996).
- The effect works when observers are unaware that they are looking at a face (Rosenblum & Saldaña, 1996).
- The effect works when observers touch-rather than look-at the face (Fowler & Dekle, 1991).
- The effect works less well with vowels than consonants (Summerfield & McGrath, 1984).
- The effect works less well with nonspeech pluck & bow stimuli (Saldaña & Rosenblum, 1994).
- The effect works better with some consonant combinations than others (e.g, McGurk & MacDonald).
Implications and Applications
The McGurk effect has significant implications for our understanding of multisensory integration and speech perception. It highlights the importance of considering both auditory and visual cues in communication. This knowledge has applications in various fields, including:
- Speech Recognition Technology: Improving speech recognition systems by incorporating visual cues.
- Communication Strategies: Developing effective communication strategies for individuals with hearing impairments.
- Neurological Research: Studying multisensory integration in individuals with neurological disorders.
Critical Points and Controversies
While the scientific basis of the McGurk effect is solid, there are some critical points and controversies surrounding the phenomenon. They argue that the perception need not be a fusion of both modalities, but rather a visual distortion of the auditory signal. There are also some methodological considerations when investigating the McGurk effect. It is important to keep these factors in mind.