BLEU score

BLEU score — Precision-based metric measuring n-gram overlap between generated text and reference translations.

Overview

BLEU (Bilingual Evaluation Understudy) is an automatic evaluation metric designed to assess the quality of machine-generated text by comparing it against one or more human reference translations. Originally developed for machine translation, BLEU measures the degree to which generated output shares n-gram sequences (contiguous word sequences) with reference text, producing a score between 0 and 1.^[1]

BLEU operates as a precision-oriented metric, prioritizing whether generated tokens appear in reference material over measuring recall or semantic equivalence. It has become a standard benchmark across machine translation, text summarization, and other sequence generation tasks, despite documented limitations in capturing semantic meaning or fluency. The metric is language-agnostic and requires no human annotation beyond the reference texts.

In the context of LLM evaluation, BLEU remains widely reported in academic papers and model cards, though it is increasingly supplemented by learned metrics and LLM-as-judge approaches that better correlate with human judgment. Understanding BLEU's mechanics and constraints is essential for interpreting published model performance claims.

How it is measured

BLEU calculates the geometric mean of modified n-gram precisions across n-gram orders (typically 1 to 4), multiplied by a brevity penalty that penalizes generated text shorter than reference text.

The core precision calculation for n-grams is:

<math>p_n = \frac{\sum_{C \in \{candidate\ sentences\}} \sum_{n\text{-gram} \in C} \min(\text{count}_{\text{candidate}}(n\text{-gram}), \text{count}_{\text{reference}}(n\text{-gram}))}{\sum_{C \in \{candidate\ sentences\}} \sum_{n\text{-gram} \in C} \text{count}_{\text{candidate}}(n\text{-gram})}</math>

where counts are clipped to the maximum count observed in any reference translation. The brevity penalty is applied as:

<math>BP = \begin{cases} 1 & \text{if } c > r \\ e^{1 - r/c} & \text{if } c \leq r \end{cases}</math>

where c is the length of the candidate translation and r is the effective reference length.

The final BLEU score is:

<math>BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)</math>

where weights w_n are typically uniform (0.25 for n = 1, 2, 3, 4).

Multiple reference translations can be compared; the metric selects the closest reference length and counts for each n-gram candidate to maximize the score. Implementations are available in sacrebleu and other standard libraries; exact numerical results can vary due to tokenization choices and preprocessing.

Distinction from related terms

Term	Distinction
ROUGE	ROUGE emphasizes recall of n-grams (and subsequences/spans) in reference text, whereas BLEU emphasizes precision. ROUGE is standard for summarization; BLEU for translation.
Precision and recall	BLEU is a single precision-focused metric applied to entire sequences. Precision and recall are separate metrics commonly used in retrieval and information retrieval contexts.
LLM-as-judge	BLEU is a fixed statistical metric requiring no neural computation. LLM-as-judge uses a model to score generated text, often correlating better with human preference but requiring additional inference cost.
Factual consistency or Hallucination measures	BLEU measures surface-level n-gram overlap and does not assess whether generated content is factually accurate or hallucinatory relative to reference material.
Semantic similarity	BLEU is lexical and discrete. Semantic similarity metrics (e.g., BERTScore, embedding-based measures) capture paraphrases and synonyms that BLEU ignores.

Examples

In the 2020 WMT (Workshop on Machine Translation) shared task, participating systems reported BLEU scores ranging from 25–35 on English-German translation, with higher scores indicating better n-gram overlap with professional human translations.
OpenNMT and other neural machine translation toolkits use sacrebleu to compute BLEU during training and evaluation, setting early-stopping thresholds based on BLEU improvement on validation sets.
A generated text "the cat sat on the mat" compared against reference "the cat is sitting on the mat" achieves high unigram and bigram overlap (e.g., "the", "on", "the mat") but misses the trigram "sat on the", illustrating BLEU's sensitivity to word choice and order.

References

↑ Papineni, K., Roukos, S., Hovy, E., & Ward, T. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002.

[papineni2002-1] Papineni, K., Roukos, S., Hovy, E., & Ward, T. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002.

[1]

Anonymous

Search

BLEU score

Namespaces

More

Page actions

Contents

Overview

How it is measured

Distinction from related terms

Examples

See also

References

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

BLEU score

Overview

How it is measured

Distinction from related terms

Examples

See also

References

Navigation

Wiki tools

Page tools

Categories