Inter-annotator agreement

From llmref.wiki
Inter-annotator agreement — Quantitative measure of consistency between two or more independent human raters assigning categorical or ordinal judgments to the same items.

Overview

Inter-annotator agreement (IAA) is a statistical framework for evaluating the reliability of human annotation in datasets used for LLM training, evaluation, and quality assurance. When multiple annotators label the same set of items—such as relevance judgments, semantic categories, or content labels—their decisions inevitably vary due to subjectivity, unclear guidelines, fatigue, or genuine semantic ambiguity. IAA quantifies this disagreement and provides confidence in downstream tasks that depend on those annotations.

In the context of automated evaluation and human evaluation of LLM outputs, IAA serves as a validity check. High agreement signals that the annotation task is well-defined and that resulting labels constitute reliable ground truth. Low agreement indicates that either the task definition requires refinement or the phenomenon being annotated is genuinely ambiguous—both findings have operational consequences for model development and benchmarking.

IAA is particularly relevant in benchmark construction, safety evaluation, and bias detection. Annotations with established inter-rater reliability are considered suitable for training reasoning models via RLHF or instruction tuning, and for computing stable ROUGE and BLEU baseline scores in comparative studies.

How it is measured

Inter-annotator agreement is measured using coefficient-based metrics that adjust for chance agreement. The choice of coefficient depends on the data type and number of raters:

  • Cohen's kappa (κ): For two raters and categorical judgments. Defined as (P_o − P_e) / (1 − P_e), where P_o is observed agreement and P_e is expected agreement by chance. Values range from −1 to 1; κ > 0.6 is typically considered acceptable, κ > 0.8 excellent.
  • Fleiss' kappa: For three or more raters and categorical data. Computes agreement across all rater pairs simultaneously and is suitable for situations where not all raters evaluate all items.
  • Krippendorff's alpha (α): A generalized agreement statistic applicable to multiple raters, missing data, and various measurement scales (nominal, ordinal, interval, ratio). Increasingly preferred in computational linguistics and annotation science for its robustness.
  • Intraclass Correlation Coefficient (ICC): For continuous scores (e.g., rating scales) and multiple raters. Used when judgments are interval or ratio-scaled rather than categorical.

In practice, researchers report both the coefficient value and its 95% confidence interval. Disagreements are often analyzed qualitatively: raters may disagree systematically on specific item types, revealing ambiguity in annotation guidelines that can be resolved through iterative refinement.

Distinction from related terms

Term Distinction
Human evaluation Human evaluation is the broader category of assessing LLM outputs using human raters. Inter-annotator agreement is a specific diagnostic metric *about* human evaluation: it measures consistency *between* raters, not the quality of LLM responses themselves.
Automated evaluation Automated evaluation uses metrics (BLEU, ROUGE, semantic similarity) that do not depend on human judgment. Inter-annotator agreement is exclusively about human raters and their consensus; automated metrics assume a single gold standard, not multiple raters.
Gold-relevance distillation Gold-relevance distillation aggregates multiple human judgments into a single consensus label, often *after* measuring IAA to ensure reliability. IAA measures raw disagreement; distillation produces the final ground truth.
Factual consistency Factual consistency is a property of an LLM output (does the response match factual reality?). Inter-annotator agreement is a property of the annotation process itself, independent of whether the phenomenon being annotated is factual or subjective.
Model card A model card documents model performance and limitations. Inter-annotator agreement is a metadata attribute: it should appear *in* a model card to indicate the reliability of human-annotated evaluation sets, but it is not the card itself.

Examples

  • Relevance annotation for retrieval evaluation: A study evaluates retrieval systems by having five annotators independently judge whether retrieved documents are relevant to 100 queries on a 4-point scale (not relevant, somewhat relevant, relevant, highly relevant). Computing Fleiss' kappa across all 500 judgments yields κ = 0.72, indicating substantial agreement. This value is reported alongside the evaluation results to justify confidence in the precision and recall metrics computed from the aggregated labels.
  • Toxicity classification for safety datasets: Researchers building a safety evaluation set for harmful-content detection have three annotators label 2,000 LLM responses as "toxic" or "non-toxic." Cohen's kappa between annotator pairs ranges from 0.68 to 0.75. The researchers identify disagreement clusters in sarcasm and culturally sensitive language, then revise annotation guidelines and re-annotate; second-round kappa improves to 0.82, warranting use of the refined dataset in model training.
  • Chain-of-thought reasoning evaluation: When assessing whether the reasoning steps in chain-of-thought outputs are logically sound, two expert annotators label 300 reasoning traces on a binary scale. Their observed agreement is 88%, but expected agreement by chance is 50%, yielding Cohen's κ = (0.88 − 0.50) / (1 − 0.50) = 0.76. This moderate-to-substantial kappa suggests the task is well-defined enough for downstream use but may benefit from clearer criteria for edge cases.

See also

References