LLM-as-judge

From llmref.wiki
LLM-as-judge — An evaluation method in which a language model is used to score, rank, or compare other model outputs as a substitute for human annotation.

Overview

LLM-as-judge is an evaluation methodology in which a capable language model (the "judge") is prompted to assess the quality of outputs produced by a model under evaluation. The judge may be asked to score on a rubric, choose the better of two responses (pairwise), or identify specific quality dimensions (accuracy, helpfulness, safety). The methodology is used to scale evaluation beyond what human annotation can practically cover, particularly for open-ended generation tasks where automated string-matching metrics (BLEU, ROUGE, exact-match) are inadequate.

LLM-as-judge was popularized by the LMSYS Chatbot Arena and the MT-Bench framework (Zheng et al., 2023).[1]

How it works

A typical LLM-as-judge call:

  1. Formulate an evaluation prompt specifying the rubric, the original query, and the response(s) to evaluate.
  2. The judge model outputs a score (e.g., 1–10) or a preference (A/B) with optional reasoning.
  3. Aggregate scores across a test set to produce comparative metrics.

Variants:

  • Absolute scoring: judge rates a single response on a scale.
  • Pairwise comparison: judge selects the better of two responses ("A is better / B is better / tie").
  • Rubric-based: structured criteria (correctness, completeness, fluency) scored separately.

Known failure modes

Failure mode Description
Position bias Judge favors responses in a fixed position (e.g., always preferring option A)
Verbosity bias Judge rates longer responses higher regardless of content quality
Self-preference bias A model used as its own judge may rate its own outputs more favorably
Inconsistency Same judge gives different scores for identical inputs on repeated runs

Mitigation strategies include: randomizing presentation order, using chain-of-thought reasoning before the score, averaging across multiple independent judge calls, and cross-validating against human annotation on a calibration subset.

Distinction from related evaluation methods

Method Key characteristic
Human annotation Ground truth but expensive and slow to scale
Automated metrics (BLEU, ROUGE) Fast but poorly correlated with quality on open-ended tasks
Retrieval eval Evaluates retrieval quality, not generation quality
Faithfulness metrics Domain-specific: measures source adherence, not general quality

LLM-as-judge is most valuable for tasks where human annotation is the gold standard but impractical at scale, and where automated metrics do not correlate with human preference.

See also

References

  1. Zheng, Lianmin et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. https://arxiv.org/abs/2306.05685