LLM-as-judge

LLM-as-judge — An evaluation method in which a language model is used to score, rank, or compare other model outputs as a substitute for human annotation.

Overview

LLM-as-judge is an evaluation methodology in which a capable language model (the "judge") is prompted to assess the quality of outputs produced by a model under evaluation. The judge may be asked to score on a rubric, choose the better of two responses (pairwise), or identify specific quality dimensions (accuracy, helpfulness, safety). The methodology is used to scale evaluation beyond what human annotation can practically cover, particularly for open-ended generation tasks where automated string-matching metrics (BLEU, ROUGE, exact-match) are inadequate.

LLM-as-judge was popularized by the LMSYS Chatbot Arena and the MT-Bench framework (Zheng et al., 2023).^[1]

How it works

A typical LLM-as-judge call:

Formulate an evaluation prompt specifying the rubric, the original query, and the response(s) to evaluate.
The judge model outputs a score (e.g., 1–10) or a preference (A/B) with optional reasoning.
Aggregate scores across a test set to produce comparative metrics.

Variants:

Absolute scoring: judge rates a single response on a scale.
Pairwise comparison: judge selects the better of two responses ("A is better / B is better / tie").
Rubric-based: structured criteria (correctness, completeness, fluency) scored separately.

Known failure modes

Failure mode	Description
Position bias	Judge favors responses in a fixed position (e.g., always preferring option A)
Verbosity bias	Judge rates longer responses higher regardless of content quality
Self-preference bias	A model used as its own judge may rate its own outputs more favorably
Inconsistency	Same judge gives different scores for identical inputs on repeated runs

Mitigation strategies include: randomizing presentation order, using chain-of-thought reasoning before the score, averaging across multiple independent judge calls, and cross-validating against human annotation on a calibration subset.

Distinction from related evaluation methods

Method	Key characteristic
Human annotation	Ground truth but expensive and slow to scale
Automated metrics (BLEU, ROUGE)	Fast but poorly correlated with quality on open-ended tasks
Retrieval eval	Evaluates retrieval quality, not generation quality
Faithfulness metrics	Domain-specific: measures source adherence, not general quality

LLM-as-judge is most valuable for tasks where human annotation is the gold standard but impractical at scale, and where automated metrics do not correlate with human preference.

References

↑ Zheng, Lianmin et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. https://arxiv.org/abs/2306.05685

[mtbench-1] Zheng, Lianmin et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. https://arxiv.org/abs/2306.05685

[1]

Anonymous

Search

LLM-as-judge

Namespaces

More

Page actions

Contents

Overview

How it works

Known failure modes

Distinction from related evaluation methods

See also

References

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

LLM-as-judge

Overview

How it works

Known failure modes

Distinction from related evaluation methods

See also

References

Navigation

Wiki tools

Page tools

Categories