Translation Performance Metrics

Modified on Fri, Nov 10, 2023 at 6:46 AM

Reference similarity metrics: These metrics measure the similarity between reference translations and machine translation output, with higher similarity leading to higher scores. They range from ones that strictly use the surface forms for comparison to ones that use semantic similarity. The most reliable measure among them is BERTScore.

1. Bilingual Evaluation Understudy Score (BLEU)

Summary: Computes similarity to reference translation based on words - higher is better - less reliable

Description: One of the most popular and cheap automatic metrics for Machine Translation evaluation. It computes the similarity of a generated translation to a reference translation (or translations) based on word token overlap, where higher values of BLEU correlate with more matching tokens. Though BLEU is simple to compute, it has several drawbacks, namely:

It does not take semantic similarity into account (ex., Exchanging imported for foreign lowers the score as much as exchanging any random word for foreign).
It does not give partial credit to near matches or morphological variants such as exchanging car for cars.
It may miss important word swaps that may alter meaning (ex., A causes B receives a similar score to B causes A).

Some of these drawbacks are alleviated by using multiple reference translations. However, additional reference translations are expensive to prepare and are often not available.

Range of values: 0% - 100%

Direction of improvement: Higher is better.

< 10 Almost useless
10 - 19 Hard to get the gist
20 - 29 The gist is clear, but has significant grammatical errors
30 - 40 Understandable to good translations
40 - 50 High-quality translations
50 - 60 Very high-quality, adequate, and fluent translations
60 Quality is often better than human

2. Character n-gram F-score (chrF)

Summary: Computes similarity to reference translation based on character sequences - higher is better - less reliable.

Description: Similar to BLEU except that it works at the level of character sequences instead of tokens. Though this alleviates the problem of near matches and morphological variants, it does not handle changes in meaning due to word reordering and penalizes semantically equivalent words.

Range of values: 0-1 or 0% - 100%

Direction of improvement: Higher is better.

3. Metric for Evaluation Translation with Explicit Ordering (METEOR)

Summary: Computes similarity to reference translation based on words, their stems, and synonyms - higher is better - less reliable.

Description: It is a traditional MT metric that attempts to overcome the shortcomings of BLEU by measuring the similarity between a machine translation with a gold-standard reference taking stems and synonyms into account.

Range of values: 0% - 100%

Direction of improvement: Higher is better.

4. BERTScore

Summary: Computes similarity to reference translation based on latent semantic representation - higher is better - moderately reliable.

Description: It expands upon METEOR by measuring semantic similarity between BERT-embeddings representations of machine translations and reference translations. Since this measure does not rely on the exact words in the reference translation, but instead their semantic representation in a latent space, and captures semantic drifts due to word reordering. BERTScore is perhaps the most robust of all reference similarity metrics.

Range of values: 0 - 100

Direction of improvement: Higher is better.

Post-editing efforts metrics: These metrics attempt to estimate human post-editing of machine translation output to produce “perfect” translations. TER is constrained by the surface form of the reference translation and COMET_HTER attempts to work at the level of semantics. COMET_HTER is a more reliable measure.

1. Translation Error Rate (TER)

Summary: Computes required post-editing effort to match a reference translation - lower is better - less reliable.

Description: This is a token-based metric to measure the post-editing effort (word insertions, deletions, shifts, and substitutions) required to convert machine translated text to the reference text (better when closer to 0). Some of the shortcomings of BLEU persist for TER also.

Range of values: >=0%

Direction of improvement: Lower is better - Scores >30% are not recommended

2. COMET Human-mediated Translation Edit Rate (COMET_HTER)

Summary: Computes required post-editing effort to match the semantics of a reference translation - lower is better - moderately reliable.

Description: This is a deep learning-based metric trained to measure the post-editing effort required to produce an adequate translation from the machine translation output. The difference between TER and COMET_HTER is a kin to the difference between BLEU and BERTScore, where for example, COMET_HTER is less affected by semantically equivalent substitutions.

Range of values: >=0%

Direction of improvement: Lower is better. If a model is getting a value close to 0, it means that the translations do not need post-editing to be corrected.

Based on results of the participating systems in the WMT-2020 competition:

The median system score is 0.126
Systems with score < 0.070 are in the top 25%

Human evaluation estimation metrics: These metrics attempt to learn the score that a human would have provided to machine translation output. These are generally considered among the most robust measures of machine translation quality.

1. COMET Direct Assessment (COMET_DA)

Summary: Estimates human evaluation scores based on a reference translation - higher is better - highly reliable.

Description: This is a deep learning-based metric trained on human direct assessments of machine-translated texts.

Range of values: Unlike BLEU and METEOR, COMET_DA is not bounded between 0 and 1

Direction of improvement: Higher is better.

Based on results of the participating systems in the WMT-2020 competition:

The median system score is 0.416
Systems with score < 0.314 are in the bottom 25%
Systems with score > 0.582837 are in the top 25%

2. BLEURT

Summary: Estimates human evaluation scores based on a reference translation - higher is better - highly reliable.

Description: This is a neural-based model which aims to predict the human score of a machine translation output. It utilizes a pre-trained transformer that is trained and fine-tuned on both synthetic and real data.

Range of values: 0-1

Direction of improvement: Higher is better.

Reference-less metrics: These metrics attempt to compare machine translation output in the absence of a ground-truth reference translation. They are considered less reliable than metrics that utilize reference translations.

1. Cross Lingual Semantic Similarity Score (CLSS) - also known as MTQuality

Summary: Estimates human evaluation scores without using a reference translation - higher is better - less reliable.

Description: CLSSS measure the cosine similarity between the embeddings of input sentence and its translation

Range of values: 0 - 1 or 0% - 100%

Direction of improvement: Higher is better.

2. COMET Quality Estimation (COMET_QE)

Summary: Estimates human evaluation scores without using a reference translation - higher is better - less reliable.

Description: This is another referenceless neural-based metric trained to evaluate the adequacy of the translation to the source text.

Range of values: 0 - 1 or 0% - 100%

Direction of improvement: Higher is better.

3. Exact Match works with functions with output modality text

Summary: A given predicted string's exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise.

Description:I f the characters of the model's prediction exactly match the characters of (one of) the True Answer(s), score is 1, otherwise score is 0. This is a strict all-or-nothing metric; being off by a single character results in a score of 0.

Range of values: 0 or 1

Direction of improvement: Higher is better.