LLM-as-judge
Overview
LLM-as-judge is an evaluation methodology in which a capable language model (the "judge") is prompted to assess the quality of outputs produced by a model under evaluation. The judge may be asked to score on a rubric, choose the better of two responses (pairwise), or identify specific quality dimensions (accuracy, helpfulness, safety). The methodology is used to scale evaluation beyond what human annotation can practically cover, particularly for open-ended generation tasks where automated string-matching metrics (BLEU, ROUGE, exact-match) are inadequate.
LLM-as-judge was popularized by the LMSYS Chatbot Arena and the MT-Bench framework (Zheng et al., 2023).[1]
How it works
A typical LLM-as-judge call:
- Formulate an evaluation prompt specifying the rubric, the original query, and the response(s) to evaluate.
- The judge model outputs a score (e.g., 1–10) or a preference (A/B) with optional reasoning.
- Aggregate scores across a test set to produce comparative metrics.
Variants:
- Absolute scoring: judge rates a single response on a scale.
- Pairwise comparison: judge selects the better of two responses ("A is better / B is better / tie").
- Rubric-based: structured criteria (correctness, completeness, fluency) scored separately.
Known failure modes
| Failure mode | Description |
|---|---|
| Position bias | Judge favors responses in a fixed position (e.g., always preferring option A) |
| Verbosity bias | Judge rates longer responses higher regardless of content quality |
| Self-preference bias | A model used as its own judge may rate its own outputs more favorably |
| Inconsistency | Same judge gives different scores for identical inputs on repeated runs |
Mitigation strategies include: randomizing presentation order, using chain-of-thought reasoning before the score, averaging across multiple independent judge calls, and cross-validating against human annotation on a calibration subset.
| Method | Key characteristic |
|---|---|
| Human annotation | Ground truth but expensive and slow to scale |
| Automated metrics (BLEU, ROUGE) | Fast but poorly correlated with quality on open-ended tasks |
| Retrieval eval | Evaluates retrieval quality, not generation quality |
| Faithfulness metrics | Domain-specific: measures source adherence, not general quality |
LLM-as-judge is most valuable for tasks where human annotation is the gold standard but impractical at scale, and where automated metrics do not correlate with human preference.
See also
- Faithfulness vs Groundedness
- Golden dataset
- Retrieval precision and recall
- Benchmark contamination
- Evaluation
References
- ↑ Zheng, Lianmin et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. https://arxiv.org/abs/2306.05685