BLEU score
Overview
BLEU (Bilingual Evaluation Understudy) is an automatic evaluation metric designed to assess the quality of machine-generated text by comparing it against one or more human reference translations. Originally developed for machine translation, BLEU measures the degree to which generated output shares n-gram sequences (contiguous word sequences) with reference text, producing a score between 0 and 1.[1]
BLEU operates as a precision-oriented metric, prioritizing whether generated tokens appear in reference material over measuring recall or semantic equivalence. It has become a standard benchmark across machine translation, text summarization, and other sequence generation tasks, despite documented limitations in capturing semantic meaning or fluency. The metric is language-agnostic and requires no human annotation beyond the reference texts.
In the context of LLM evaluation, BLEU remains widely reported in academic papers and model cards, though it is increasingly supplemented by learned metrics and LLM-as-judge approaches that better correlate with human judgment. Understanding BLEU's mechanics and constraints is essential for interpreting published model performance claims.
How it is measured
BLEU calculates the geometric mean of modified n-gram precisions across n-gram orders (typically 1 to 4), multiplied by a brevity penalty that penalizes generated text shorter than reference text.
The core precision calculation for n-grams is:
<math>p_n = \frac{\sum_{C \in \{candidate\ sentences\}} \sum_{n\text{-gram} \in C} \min(\text{count}_{\text{candidate}}(n\text{-gram}), \text{count}_{\text{reference}}(n\text{-gram}))}{\sum_{C \in \{candidate\ sentences\}} \sum_{n\text{-gram} \in C} \text{count}_{\text{candidate}}(n\text{-gram})}</math>
where counts are clipped to the maximum count observed in any reference translation. The brevity penalty is applied as:
<math>BP = \begin{cases} 1 & \text{if } c > r \\ e^{1 - r/c} & \text{if } c \leq r \end{cases}</math>
where c is the length of the candidate translation and r is the effective reference length.
The final BLEU score is:
<math>BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)</math>
where weights w_n are typically uniform (0.25 for n = 1, 2, 3, 4).
Multiple reference translations can be compared; the metric selects the closest reference length and counts for each n-gram candidate to maximize the score. Implementations are available in sacrebleu and other standard libraries; exact numerical results can vary due to tokenization choices and preprocessing.
| Term | Distinction |
|---|---|
| ROUGE | ROUGE emphasizes recall of n-grams (and subsequences/spans) in reference text, whereas BLEU emphasizes precision. ROUGE is standard for summarization; BLEU for translation. |
| Precision and recall | BLEU is a single precision-focused metric applied to entire sequences. Precision and recall are separate metrics commonly used in retrieval and information retrieval contexts. |
| LLM-as-judge | BLEU is a fixed statistical metric requiring no neural computation. LLM-as-judge uses a model to score generated text, often correlating better with human preference but requiring additional inference cost. |
| Factual consistency or Hallucination measures | BLEU measures surface-level n-gram overlap and does not assess whether generated content is factually accurate or hallucinatory relative to reference material. |
| Semantic similarity | BLEU is lexical and discrete. Semantic similarity metrics (e.g., BERTScore, embedding-based measures) capture paraphrases and synonyms that BLEU ignores. |
Examples
- In the 2020 WMT (Workshop on Machine Translation) shared task, participating systems reported BLEU scores ranging from 25–35 on English-German translation, with higher scores indicating better n-gram overlap with professional human translations.
- OpenNMT and other neural machine translation toolkits use sacrebleu to compute BLEU during training and evaluation, setting early-stopping thresholds based on BLEU improvement on validation sets.
- A generated text "the cat sat on the mat" compared against reference "the cat is sitting on the mat" achieves high unigram and bigram overlap (e.g., "the", "on", "the mat") but misses the trigram "sat on the", illustrating BLEU's sensitivity to word choice and order.
See also
- Large language model — foundational technology generating text evaluated by BLEU and other metrics
- Foundation model — base models whose translation or text generation capabilities are commonly benchmarked with BLEU
- Benchmark contamination — risk that BLEU test sets appear in pretraining data, inflating apparent performance
- LLM-as-judge — modern alternative evaluation approach addressing BLEU's limitations in semantic correlation
- Instruction tuning — technique for training models to improve downstream metrics including BLEU
References
- ↑ Papineni, K., Roukos, S., Hovy, E., & Ward, T. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002.