Emotional Prosody: Unveiling Emotions in Speech
Emotional prosody, also known as affective prosody, encompasses the various paralinguistic aspects of language use that convey emotion. It includes an individual's tone of voice in speech that is conveyed through changes in pitch, loudness, timbre, speech rate, and pauses. This channel of language conveys emotions felt by the speaker and gives us as listeners a better idea of the intended meaning. Nuances in this channel are expressed through intonation, intensity, a rhythm which combined for prosody. It can be isolated from semantic information and interacts with verbal content.
Language can be split into two components: the verbal and vocal channels. The verbal channel is the semantic content made by the speaker's chosen words. In the verbal channel, the semantic content of the speakers words determines the meaning of the sentence. The way a sentence is spoken, however, can change its meaning which is the vocal channel. Usually these channels convey the same emotion, but sometimes they differ.
Decoding emotions in speech includes three stages: determining acoustic features, creating meaningful connections with these features, and processing the acoustic patterns in relation to the connections established. In the processing stage, connections with basic emotional knowledge is stored separately in memory network specific to associations. These associations can be used to form a baseline for emotional expressions encountered in the future. On average, listeners are able to perceive intended emotions exhibited to them at a rate significantly better than chance (chance=approximately 10%). However, error rates are also high.
Emotional prosody in speech is perceived or decoded slightly worse than facial expressions but accuracy varies with emotions. Emotional states such as happiness, sadness, anger, and disgust can be determined solely based on the acoustic structure of a non-linguistic speech act. These acts can be grunts, sighs, exclamations, etc. In addition, it has been proven that emotion can be expressed in non-linguistic vocalizations differently than in speech. As Laukka et al. Speech requires highly precise and coordinated movement of the articulators (e.g., lips, tongue, and larynx) in order to transmit linguistic information, whereas non-linguistic vocalizations are not constrained by linguistic codes and thus do not require such precise articulations. In their study, actors were instructed to vocalize an array of different emotions without words. The study showed that listeners could identify a wide range of positive and negative emotions above chance.
In a 2015 study by Verena Kersken, Klaus Zuberbühler and Juan-Carlos Gomez, non-linguistic vocalizations of infants were presented to adults to see if the adults could distinguish from infant vocalizations indicating requests for help, pointing to an object, or indicating an event. Infants show different prosodic elements in crying, depending on what they are crying for. They also have differing outbursts for positive and negative emotional states.
Most research regarding vocal expression of emotion has been studied through the use of synthetic speech or portrayals of emotion by professional actors. Little research has been done with spontaneous, "natural" speech samples. These artificial speech samples have been considered to be close to natural speech but specifically portrayals by actors may be influenced stereotypes of emotional vocal expression and may exhibit intensified characteristics of speech skewing listeners perceptions. Another consideration lies in listeners individual perceptions. Studies typically take the average of responses but few examine individual differences in great depth.
Acoustic Features of Specific Emotions
Specific emotions manifest unique acoustic profiles in speech:
- Anger: Anger can be divided into two types: "anger" and "hot anger". In comparison to neutral speech, anger is produced with a lower pitch, higher intensity, more energy (500 Hz) across the vocalization, higher first formant (first sound produced) and faster attack times at voice onset (the start of speech).
- Disgust: In comparison to neutral speech, disgust is produced with a lower, downward directed pitch, with energy (500 Hz), lower first formant, and fast attack times similar to anger.
- Fear: Fear can be divided into two types: "panic" and "anxiety".
Here is a table summarizing the acoustic features of different emotions:
| Emotion | Acoustic Features |
|---|---|
| Anger | Lower pitch, higher intensity, more energy, higher first formant, faster attack times |
| Disgust | Lower, downward directed pitch, energy, lower first formant, fast attack times |
| Fear | Panic and anxiety |
Prosody is a word used to describe the rate, rhythm, and melody of our speech. We change these features when we speak to convey meaning beyond the words. This can include emotional content such as humor, sarcasm, and emotion. For example, the word “fine” can convey a range of emotions and meanings depending upon the tone or prosody the speaker uses. Prosody can also convey information about the speaker’s meaning or intent, such as when asking a question or placing stress upon a word. For instance, the question, “What are you doing here?” is asking for very different information depending on the stress or emphasis used (i.e., “What are you doing here?” vs. “What are you doing here?”).
Emotional prosody is the study of how emotions are conveyed and perceived through the acoustic qualities of speech. This subtle yet powerful aspect of communication provides deeper insights into human expression. In linguistics and the study of human behavior, emotional prosody refers to the modulation of acoustic elements in speech-such as pitch, rhythm, intensity, and duration-to express emotions. This subtle yet powerful phenomenon works alongside the semantic content of speech, adding an emotionally rich layer to verbal expression.

Pitch is arguably the most prominent feature of emotional prosody and plays a key role in conveying emotional states. For example, a wider pitch range and higher average pitch are often associated with positive emotions like joy or excitement. The rhythmic aspects of speech, including tempo and timing, are key components of emotional prosody, the vocal patterns that convey emotions and intent. The tempo of speech significantly affects how emotions are communicated. Speaking at an accelerated tempo often conveys excitement, urgency, or enthusiasm. For example, a salesperson might speak quickly to create a sense of urgency in a pitch.
Pauses, both silent and vocalized, are powerful tools for emotional expression. For instance, an orator might pause before delivering a climactic line in a speech to heighten its impact. Pauses are not just about emotion; they also reveal cognitive processes, especially during deceptive communication. In emotional communication, loudness plays a more prominent role than pitch in influencing perceptions of truthfulness. Research, such as that by Benus et al. (2006), advocates for analyzing speech holistically by combining prosodic elements like pauses and loudness with additional lexical and speaker-specific features.
Intensity and duration play key roles in shaping emotions in speech. Intensity, which relates to how loud or soft speech is, helps express emotions. Duration refers to how long sounds, pauses, and speech patterns last. Long vowels, extended pauses, or fast speech all add meaning to emotions. Culture and personal traits both shape how emotions are expressed through speech. Different cultures have unique ways of using tone and rhythm to show feelings. At the same time, personal traits add another layer of variety. Personality, thinking styles, and life experiences can change how someone expresses or understands emotions. For example, an outgoing person might use more dramatic tones, while a quieter person might express emotions gently. These differences show how both culture and individuality influence emotional communication.
Neurological Aspects of Emotional Prosody
Neurological processes integrating verbal and vocal (prosodic) components are relatively unclear. However, it is assumed that verbal content and vocal are processed in different hemispheres of the brain. Verbal content composed of syntactic and semantic information is processed in the left hemisphere. Syntactic information is processed primarily in the frontal regions and a small part of the temporal lobe of the brain while semantic information is processed primarily in the temporal regions with a smaller part of the frontal lobes incorporated. In contrast, prosody is processed primarily in the same pathway as verbal content, but in the right hemisphere. Neuroimaging studies using functional magnetic resonance imaging (fMRI) machines provide further support for this hemisphere lateralization and temporo-frontal activation. Some studies however show evidence that prosody perception is not exclusively lateralized to the right hemisphere and may be more bilateral.
Deficits in expressing and understanding prosody, caused by right hemisphere lesions, are known as aprosodias. These can manifest in different forms and in various mental illnesses or diseases. Aprosodia can be caused by stroke and alcohol abuse as well. RHD often causes difficulty with producing or understanding prosody, especially emotional prosody. This disorder is called aprosodia (a pro so’ dia). Aprosodia often results in “flat” sounding or monotone speech. This is often coupled with minimal change in facial expressions and body language, making it hard to read that person’s emotions or intentions (was he joking or being serious?). The person may also have difficulty understanding others’ use of prosody and body language. This can cause misunderstandings and make it appear like the person with RHD is being insensitive to their partners’ emotions and subtle meaning. Prosodic features are often paired with body language or facial expressions to help us send our intended message.
Because the right hemisphere of the brain is associated with prosody, patients with right hemisphere lesions have difficulty varying speech patterns to convey emotion. Their speech may therefore sound monotonous. Difficulty in decoding both syntactic and affective prosody is also found in people with autism spectrum disorder and schizophrenia, where "patients have deficits in a large number of functional domains, including social skills and social cognition. These social impairments consist of difficulties in perceiving, understanding, anticipating and reacting to social cues that are crucial for normal social interaction." This has been determined in multiple studies, such as Hoekert et al.'s 2017 study on emotional prosody in schizophrenia, which illustrated that more research must be done to fully confirm the correlation between the illness and emotional prosody.
It has been found that it gets increasingly difficult to recognize vocal expressions of emotion with increasing age. Older adults have slightly more difficulty labeling vocal expressions of emotion, particularly sadness and anger than young adults but have a much greater difficulty integrating vocal emotions and corresponding facial expressions. A possible explanation for this difficulty is that combining two sources of emotion requires greater activation of emotion areas of the brain, in which adults show decreased volume and activity. Another possible explanation is that hearing loss could have led to a mishearing of vocal expressions.
Men and women differ in both how they use language and also how they understand it. It is known that there is a difference in the rate of speech, the range of pitch, and the duration of speech, and pitch slope (Fitzsimmons et al.). For example, "In a study of relationship of spectral and prosodic signs, it was established that the dependence of pitch and duration differed in men and women uttering the sentences in affirmative and inquisitive intonation. Tempo of speech, pitch range, and pitch steepness differ between the genders" (Nesic et al.). Women and men are also different in how they neurologically process emotional prosody. In an fMRI study, men showed a stronger activation in more cortical areas than female subjects when processing the meaning or manner of an emotional phrase. In the manner task, men had more activation in the bilateral middle temporal gyri. For women, the only area of significance was the right posterior cerebellar lobe. Male subjects in this study showed stronger activation in the prefrontal cortex, and on average needed a longer response time than female subjects. This result was interpreted to mean that men need to make conscious inferences about the acts and intentions of the speaker, while women may do this sub-consciously.
Applications of Emotional Prosody
Understanding emotional prosody goes far beyond theoretical interest, playing a significant role in numerous practical fields such as human-computer interaction, artificial intelligence, and clinical psychology. For instance, the ability to recognize and replicate emotional prosody in speech synthesis systems has been instrumental in creating virtual assistants that feel more natural and emotionally engaging. In clinical settings, the analysis of emotional prosody offers valuable insights into psychological health. It has proven useful in diagnosing and treating a range of psychological disorders, such as autism spectrum disorders, mood disorders, and even depression. For individuals with autism, prosody analysis can highlight challenges in emotional expression, providing therapists with actionable data to guide interventions. Additionally, emotional prosody research has implications for education and social training. Programs designed to enhance emotional literacy and communication skills can use prosody insights to teach individuals how to better express and interpret emotions through speech.
Explore the world of emotional prosody with iMotions’ latest update. Uncover how speech unveils human emotions beyond words, through pitch, rhythm, and intonation. The integration of voice analysis into the iMotions Software suite has opened exciting opportunities for emotion research. By analyzing the emotional dimensions of speech, researchers can uncover hidden layers of communication in respondent testimonies, adding depth and nuance to their studies. Emotional prosody serves as a crucial link between people as well as a captivating bridge between linguistics and psychology, unraveling the intricate ways our voices encode and decode emotions.