Translation Quality Assessment Methods
Language translation quality assessment is crucial in ensuring accurate and effective communication across languages. Assessing the quality of language translation involves a multifaceted approach that combines traditional and modern metrics.
With the advent of machine translation and neural network-based models, the metrics for evaluating translation quality have evolved significantly. For enterprises operating on a global scale, the disconnect between traditional automated scores and the actual, perceived quality of a translation can have significant consequences.
While the goal is clear-flawless communication-the methods for measuring it have been a subject of intense debate and innovation. For years, the translation industry has relied on a set of automated metrics to provide a fast, scalable way to benchmark machine translation (MT) systems.
Automated Metrics
Automated metrics provide a scalable and consistent approach to translation quality assessment. These metrics use algorithms to compare translations against reference texts, providing objective scores that can be used for benchmarking and comparison.
While traditional metrics like BLEU, METEOR, and TER provide a foundational framework, modern metrics like COMET, BERTScore, and GEMBA offer advanced capabilities through deep learning and semantic analysis.
BLEU (Bilingual Evaluation Understudy)
The BLEU score is one of the most widely used metrics for evaluating machine translation quality. It measures the correspondence between a machine’s output and reference translations done by humans. The score ranges from 0 to 1, with higher scores indicating better quality. BLEU focuses on precision by comparing n-grams (contiguous sequences of n items from a given sample of text) between the translated and reference texts.
In simple terms, BLEU compares a machine-generated text to one or more human reference translations, counting the overlapping words and phrases to generate a score. The more overlap, the higher the score. While these metrics served a purpose in the early days of MT, their limitations have become increasingly apparent. Their core flaw is an inability to understand semantics, context, or style.
METEOR (Metric for Evaluation of Translation with Explicit ORdering)
METEOR was developed to address some of BLEU’s shortcomings. It considers precision and recall, making it more sensitive to synonymy and stemming. METEOR aligns translations with reference texts using stemming and synonyms, thus providing a more linguistically informed evaluation.
TER (Translation Edit Rate)
TER measures the number of edits required to change a machine-translated output into one of the reference translations. These edits can include insertions, deletions, substitutions, and shifts. TER is expressed as the number of edits divided by the average length of the reference translations.
COMET
COMET is a recent addition to the suite of metrics for translation quality assessment. It leverages pre-trained language models and fine-tunes them for specific translation tasks. COMET evaluates translations based on their semantic similarity to reference texts, providing a more nuanced assessment of meaning and context.
BERTScore
BERTScore evaluates translations by computing token-level similarity scores using contextual embeddings from BERT (Bidirectional Encoder Representations from Transformers). This metric captures semantic meaning and context, offering a robust evaluation framework for translation quality.
GEMBA
GEMBA is an advanced metric that integrates elements from various evaluation methods, including BLEU, METEOR, and TER. It utilizes large language models (LLMs) to provide a comprehensive assessment of translation quality, considering both surface-level and semantic similarities.
Human Evaluation
Human evaluation remains a critical component of translation quality assessment. Experienced translators and linguists review translations to identify errors, assess fluency, and ensure accuracy. Human evaluators use predefined criteria and guidelines to provide objective and consistent evaluations.
Given the shortcomings of automated scores, human evaluation remains the gold standard for assessing translation quality. Professional linguists can discern the subtle nuances that machines often miss-assessing tone, cultural appropriateness, style, and brand voice. They can determine if a translation is not just technically correct but also engaging and persuasive. However, human evaluation comes with its own trade-offs.
It is time-consuming and can be expensive to scale, making it challenging to implement across the vast volumes of content that global enterprises produce. This creates a core conflict for any business looking to expand internationally: How do you achieve the deep, nuanced quality of human assessment with the speed, scale, and cost-efficiency that automation promises?
Incorporating both automated and human evaluations ensures a comprehensive assessment that addresses accuracy, fluency, consistency, and cultural appropriateness.
Evaluation Criteria
- Accuracy: Accuracy refers to how well the translated text reflects the meaning of the source text. This involves correct translation of terms, phrases, and sentences without adding, omitting, or altering the original meaning.
- Fluency: Fluency assesses the readability and grammatical correctness of the translated text in the target language. A fluent translation should read naturally, without awkward phrasing or grammatical errors.
- Consistency: Consistency in translation involves the uniform use of terminology and style throughout the text. This is particularly important for technical or specialized translations where specific terms must be used consistently to avoid confusion.
- Cultural Appropriateness: Cultural appropriateness evaluates whether the translation respects and adapts to the cultural norms and sensibilities of the target audience. This involves localizing idioms, expressions, and references to make the translation relevant and acceptable to the target culture.
Methods of Human Evaluation
- Evaluators provide an overall score based on their subjective judgment of the translation’s quality.
- Evaluators identify and categorize specific errors in the translation.
- Evaluators rate the translation on two separate scales: adequacy (how well the translation conveys the meaning of the source text) and fluency (how well the translation reads in the target language).
Back-Translation (BT)
Back-translation (BT) is a commonly used quality assessment tool in cross-cultural research. This quality assurance technique consists of (a) translation (target text [TT1]) of the source text (ST), (b) translation (TT2) of TT1 back into the source language, and (c) comparison of TT2 with ST to make sure there are no discrepancies. The accuracy of the BT with respect to the source is supposed to reflect equivalence/accuracy of the TT.
BT as a quality assessment tool in cross-cultural research can be said to have originated with the work of Brislin (1970, 1986). Although in the original 1970 publication, BT was part of a larger and more complex proposal on evaluation, it is the BT aspect of the investigation that has transcended and has been used extensively as the primary method of quality evaluation in some cases and in combination with additional quality control methodologies, such as pretesting, posttesting, and committee consultation, in others.
BT is considered “by far the most popular quality assessment tool used in international and cross-cultural social research” (Tyupa, 2011, p. 36). Tyupa (2011, p. 36) reports that in international nursing research, a survey by Maneesriwongul and Dixon (2004, p. 177) showed that of the 47 studies devoted to instrument translation from English into other languages (Chinese, Spanish, Korean, and Finnish), 38 used BT and 13 of these resorted exclusively to BT (no testing). Douglas and Craig (2007, p.
In addition to its presence in social studies, BT figures prominently in health care, specifically in quality-of-life research. Because of the growing interest on the “international” side of health-related quality-of-life (HRQOL) questionnaires and other health-related surveys, researchers are faced with the need to translate existing measures.
Although in early studies, BT was the “recommended technique” for quality control (Berkanovic, 1980, p. 1273), in more recent research, translation and cross-cultural adaptation of questionnaires generally include BT as part of the quality control methodology generally motivated by the desire to attain equivalence with existing validated English questionnaires and to make sure differences are due to patient, rather than questionnaire differences (Bonomi et al., 1996; Cella et al., 1998; Eremenco, Cella, & Arnold, 2005; Guillemin, Bombardier, & Beaton, 1993; Herdman, Fox-Rushby, & Badia, 1997).
Eremenco, Cella, and Arnold (2005, p. 217), for instance, describe a translation methodology in which two independent translators produce two translations (TT1 and TT2) of a source text (ST); TT1 and TT2 are reconciled by another translator in a third translation (TT3). In health fields such as public health and audiology, BT is also a component of translation methodologies and transcultural adaptation of questionnaires, along the lines of HRQOL research (Cardemil et al., 2013; Lichtenstein & Hazuda, 1998).
Cardemil et al. (2013, p. 418), for instance, report a transcultural adaptation process in which the Effectiveness of Auditory Rehabilitation Quality-of-Life Evaluation Scale was translated by two bilingual professionals in the areas of otolaryngology and audiology prior to the generation of a consensus version in Spanish. This scale was then back-translated into English by a third bilingual professional and compared to the original version by the author of the original scale.
Lichtenstein and Hazuda (1998) describe a process in which the Hearing Handicap Inventory for the Elderly (HHIE) was translated independently by “three individuals fluent in English and the Spanish spoken among Mexican Americans in South Texas.” These were followed by three BTs that were reviewed by an independent reviewer. In sum, a sample of the literature on the evaluation of cross-cultural instruments demonstrates the prominent role played by BT, used either on its own or in combination with other methods of evaluation (cf. Table 1).
Bearing in mind that BT is well established as a quality control mechanism in cross-cultural adaptation of questionnaires and in cross-cultural research in general, it is critical to ascertain its methodological soundness, in particular, by bringing in the interdisciplinary perspective and current knowledge in pertinent fields, such as translation studies.
The rest of this article presents the problems faced by BT from a conceptual and theoretical point of view (whether used on its own or in conjunction with committee review or other methods), on the basis of current knowledge in translation studies, and accompanied by illustrative examples (cf. Table 2). It also introduces recent critique and empirical evidence against BT to support the view that BT should be entirely abandoned as a method of translation quality evaluation (cf. Table 2). We propose an alternative functionalist approach based on a case study.
Additionally, we argue that any approach to the development of multilingual research instruments will have to consider the specifics of the concrete project and the purpose and audience for which the cross-cultural instruments are adapted/developed. Most instruments will still require separate validation; yet, an approach based on current knowledge on the nature of translation and language mediation has a better chance of obtaining good validation results as well as valid useful findings.
Challenges of Back-Translation
Despite its commonly accepted use as a standard method of translation quality assessment, BT can present serious challenges of which researchers need to be cognizant. Brislin, in his seminal article about BT in cross-cultural research, warns that a BT may suggest that a translation is equivalent to its ST, since several factors besides good translation can create seeming equivalence between source, target, and back-translated versions (1970, p. 186).
The bilingual translating from the source language to the target may retain many of the grammatical forms of the source. This version would be easy to back-translate, but worthless for the purpose of asking questions of target-language monolinguals since its grammar is that of the source, not the target (p. Brislin is referring here to well-known issues in translation studies, such as (i) that equivalence and in particular perfect equivalence is an unattainable goal; (ii) that BT introduces as an additional layer of translation/transformation, which may bring the target closer to or further from the source; (iii) that the more literal the translation, the closer the BT will be to the source.
Note that the “Brislin method” is different from BT and as such has broader value. While BT has its origins in Brislin's proposal, there is much more to this proposal than BT, including the recommendation for a critical use of BT in conjunction with other methods. Generally, what has been adopted from Brislin's work is simply the (uncritical) use of BT (the absence of critical assessment explains in part why BT is still in use).
Tyupa (2011, p. 36) identifies the linguistic theory behind BT as its major flaw. Brislin (1970), based on Nida (1964), sees equivalence of meaning as the goal of the translation process, more specifically the creation of an equivalent response in the reader. For Tyupa, the problem with that framework is “that meaning was viewed from an objectivist position, the approach adopted in the mainstream linguistics of the day. This contrasts with the view adopted in cognitive linguistics, which treats meaning as inherently dynamic and equates it with conceptualization” (2011, p. 37).
In other words, BT is based on the view of meaning prevalent at the time that Brislin published his work. Today, objectivist theories of meaning have been shown by research in linguistics and psychology to be misguided. As Tyupa states “there is no inherent, or objective meaning in the original questionnaires; meanings arise through the process of conceptualization, be it that of the developers, translators, or reviewers” (2011, p.
In addition to reflecting an objectivist view of meaning, BT relies on an understanding of translation that goes back to the 1970s, not surprisingly, given the influential role of Brislin's proposal from that time. In consonance with the objectivist views of meaning and of structural linguistics, the ultimate purpose of translation was to create a TT that was equivalent to the ST (Catford, 1964; Nida, 1964); therefore, if at some point equivalence was not attained, then BT should be able to identify the errors by highlighting points of difference with the ST.
However, research in translation studies, starting at least in the 1980s, has sufficiently shown that this was a restricted and partial notion of translation that focused almost exclusively on the printed text as an object-seen as a sequence of sentences-and on the linguistic structure of these sentences. Essential elements in translation and in text production and reception such as audience, purpose, and social conditions of the text or the translation were rarely, if ever, considered.
Another important element behind BT is the notion of equivalence. Equivalence is the underlying principle guiding the translation and cultural adaptation of instruments first developed and validated in English, so that they can be used in international contexts without undergoing an entirely new development and validation process. In these fields, BT is one way to make sure that the TT is equivalent to the ST.
Much of the literature on cross-cultural research refers simply to equivalence in general, while only a few studies break it down into subtypes. Of these, semantic equivalence is probably the most common, defined generally as equivalence in the meaning of words (Guillemin Bombardier, & Beaton 1993, p. 1423; Lichtenstein & Hazuda 1998); a few studies mention other subtypes such as cultural equivalence (similar meaning and relevance of the constructs examined across cultures) and functional equivalence (the degree to what a concept performs the same way or elicits similar responses; Jones, Lee, Phillips, Zhang Xinwei, & Jaceldo, 2001, p. 300) with very few referring to elements beyond the word or phrase (e.g., Guillemin's experiential equivalence, i.e., the situations evoked should fit the target context).
In other words, most notions of equivalence are word based or concept based, as they were at the time in which Brislin published his influential work. Similarly, this view of equivalence does not consider the reader, the context of the translation, or the text as a unit (Colina, 2015, pp.
In a review of the definitions of the different types of equivalence discussed in the HRQOL literature, Herdman, Fox-Rushby, and Badia (1997) found that there is a “distinct lack of clarity and a considerable amount of confusion surrounding the way in which various types of equivalence are defined within the HRQOL field” (p. 243). In other words, researchers in this field are coming to the realization of what is a mainstay in translation studies: equivalence is a controversial, vague, and hard to define term.
The use of BT as a quality control method for translation in cross-cultural research highlights a lay view of translation, based on the notions prevalent in the 1970s, such as meaning as an objective reality and equivalence (at the conceptual or semantic level) as the ultimate measure of translation quality. These notions remain mostly unchallenged in cross-cultural research (with a few exceptions, cf. below) even into the first decade of the 21st century.
Among studies in cross-cultural research (e.g., marketing, nursing, quality of life, audiology, etc.), it is difficult to find any references to recent work in translation studies aside from the original work by Brislin and references therein. Even a more recent publication containing a separate section on translation theory only included references to Brislin (1970; e.g., Jones et al., 2001). A notable exception, however, is Fourie and Feinauer (2005) who conducted an updated and comprehensive review of relevant work in translation studies in the context of medicine and health.
While BT can be useful to spot errors in the translation of purely referential meaning and one-to-one meaning correspondences, such as specialized technical terminology in highly specialized objective texts (e.g., translation of chemical compounds in a chemistry paper), it can actually work in the opposite fashion it was intended for other types of text/language. As Brislin noted (see quote above;...
Time to Edit (TTE): A Modern Metric
To solve this challenge, the industry is moving toward more sophisticated, human-centric metrics. At Translated, we have pioneered the use of Time to Edit (TTE), a groundbreaking metric that redefines quality assessment. TTE measures the time a professional translator takes to edit a machine-translated segment to make it perfect. It is a direct, empirical measure of the friction between the AI’s output and human standards of excellence.
It measures real-world effort: Unlike abstract scores, TTE quantifies the actual work required to achieve a flawless translation. It embodies the Human-AI symbiosis: TTE is the ultimate expression of our collaborative philosophy. It aligns with business goals: For any enterprise, time is money.
This innovative approach is powered by our core Language AI Solutions. While we innovate, we also respect the established frameworks that have guided the industry. Standards like ISO 17100 have been crucial in defining the requirements for a quality translation process, emphasizing the need for qualified professionals and rigorous review workflows. We see our methodology not as a replacement for these standards, but as the next evolution.
Translated’s TTE-based approach offers a dynamic, real-time benchmark that goes beyond static process requirements. It provides a continuous measure of quality that adapts and improves with every project. This data-driven model allows us to track our progress toward what we call the “singularity” in translation-the point at which machine translation becomes indistinguishable from human translation.
Achieving this level of quality requires a tightly integrated ecosystem of technology and talent. Our TranslationOS serves as the central platform for this entire process. It is where workflows are managed, quality is measured in real-time, and performance data is captured. This creates a powerful feedback loop that drives continuous improvement.
Our Professional Translation Agency is a crucial part of this quality engine. Our global network of expert linguists provides the essential human touch, performing the final edits that ensure perfection. Their work does more than just finalize a project; it generates the high-quality data that trains our Language AI to become even more accurate and context-aware.
Best Practices for Translation Quality Assessment
Using a mix of automated metrics and human evaluations provides a balanced approach to translation quality assessment.
Regularly monitoring translation quality and providing feedback to translators helps maintain high standards. Investing in the training and development of translators ensures that they are equipped with the necessary skills and knowledge to produce high-quality translations. Customizing evaluation metrics to suit specific translation projects can enhance the relevance and accuracy of the assessment.
Specific Translation Types
- Accuracy and consistency are paramount in technical translations.
- Fluency and cultural appropriateness are crucial for literary translations.
- Marketing translations require cultural appropriateness and creativity.
The science of measuring translation quality has evolved far beyond simplistic, automated scores. It has become a sophisticated, data-driven discipline that places human expertise at its very center. For enterprises that cannot afford to compromise on quality, legacy metrics like BLEU are no longer sufficient. The new standard is a dynamic, transparent, and measurable approach that reflects real-world efficiency and impact. Metrics like Time to Edit (TTE), powered by a purpose-built Language AI and managed within an integrated TranslationOS, offer the only reliable path to achieving consistent, high-impact global communication at scale.