Voice Quality Assessment Methods
Voice quality assessment plays a vital role in diagnosing and monitoring voice disorders. Traditional methods, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and the Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS) scales, rely on expert raters.
Perceptual voice quality assessment plays a vital role in diagnosing and monitoring voice disorders. Traditional methods, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and the Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS) scales, rely on expert raters and are prone to inter-rater variability, emphasizing the need for objective solutions.
This article explores various methods used for voice quality assessment, including both traditional perceptual evaluations and modern objective techniques.
Perceptual Evaluation of Speech Quality (PESQ)
Perceptual Evaluation of Speech Quality (PESQ) is a family of standards comprising a test methodology for automated assessment of the speech quality as experienced by a user of a telephony system. It was standardized as Recommendation ITU-T P.862 in 2001. PESQ is used for objective voice quality testing by phone manufacturers, network equipment vendors and telecom operators. Its usage requires a license.
ITU-T's family of full reference objective voice quality measurements started in 1997 with Recommendation ITU-T P.861 (PSQM), which was superseded by ITU-T P.862 (PESQ) in 2001. P.862 was later complemented with Recommendations ITU-T P.862.1 (mapping of PESQ scores to a MOS scale), ITU-T P.862.2 (wideband measurements) and ITU-T P.862.3 (application guide). The first edition of ITU-T P.863 (POLQA) entered into force in 2011.
PESQ results principally model mean opinion scores (MOS) that cover a scale from 1 (bad) to 5 (excellent).
PESQ was developed to model subjective tests commonly used in telecommunications (e.g., Recommendation ITU-T P.800) to assess the voice quality perceived by human beings. In order to characterize the listening quality as perceived by users, it is of paramount importance to load modern telecom equipment with speech-like signals. Many systems are optimized for speech and would respond in an unpredictable way to non-speech signals (e.g., tones, noise).

A "full reference" (FR) algorithm has access to and makes use of the original reference signal for a comparison (i.e., a difference analysis). It can compare each sample of the reference signal (talker side) to each corresponding sample of the degraded signal (listener side). A "no reference" (NR) algorithm only uses the degraded signal for the quality estimation and has no information of the original reference signal.
PESQ is a full-reference algorithm and analyzes the speech signal sample-by-sample after a temporal alignment of corresponding excerpts of reference and test signal.
NR algorithms (e.g., Recommendation ITU-T P.563) are low-accuracy estimates only, as the originating voice characteristics (e.g., male or female talker, background noise, non-voice) of the source reference is completely unknown. A common variant of NR algorithms does not even analyze the decoded audio signal, but works on an analysis of the digital bit stream on an IP packet level.
Voice Quality Assessment Network (VOQANet)
This study introduces the Voice Quality Assessment Network (VOQANet), a deep learning framework that employs an attention mechanism and Speech Foundation Model (SFM) embeddings to extract high-level features. To further enhance performance, we propose VOQANet+, which integrates self-supervised SFM embeddings with low-level acoustic descriptors-namely jitter, shimmer, and harmonics-to-noise ratio (HNR).
Unlike previous approaches that focus solely on vowel-based phonation (PVQD-A), our models are evaluated on both vowel-level and sentence-level speech (PVQD-S) to assess generalizability. Experimental results demonstrate that sentence-based inputs yield higher accuracy, particularly at the patient level.
Overall, VOQANet consistently outperforms baseline models in terms of root mean squared error (RMSE) and Pearson correlation coefficient across CAPE-V and GRBAS dimensions, with VOQANet+ achieving even greater performance gains. Additionally, VOQANet+ maintains consistent performance under noisy conditions, suggesting enhanced robustness for real-world and telehealth applications.

Here’s a summary of the key objective voice quality measurement recommendations from ITU-T:
| Recommendation | Description |
|---|---|
| ITU-T P.862 (PESQ) | Objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. |
| ITU-T P.862.1 | Mapping function for transforming P.862 raw result scores to MOS-LQO. |
| ITU-T P.862.2 | Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs. |
| ITU-T P.862.3 | Application guide for objective quality measurement based on Recommendations P.862, P.862.1 and P.862.2. |
| ITU-T P.863 (POLQA) | Perceptual objective listening quality prediction. |
| ITU-T P.563 | Single-ended method for objective speech quality assessment in narrow-band telephony applications. |