Retrieval precision and recall

From llmref.wiki
Retrieval precision and recall — Information-retrieval metrics measuring, respectively, the fraction of retrieved documents that are relevant and the fraction of all relevant documents that are retrieved.

Overview

Retrieval precision and retrieval recall are the two fundamental metrics for evaluating the quality of a retrieval system — the component of a RAG pipeline that fetches documents from a corpus in response to a query. They originate in classical information retrieval and are directly applicable to dense and sparse retrieval in LLM pipelines.

  • Precision@K: of the K documents returned, what fraction are relevant?
  • Recall@K: of all relevant documents in the corpus, what fraction appear in the top K returned?

Precision and recall trade off against each other as K increases: a larger K typically raises recall (more relevant documents retrieved) but lowers precision (a higher proportion of irrelevant documents is included).

Notation and variants

Metric Formula Interpretation
Precision@K (Relevant documents in top K) / K How much noise is in the retrieved set?
Recall@K (Relevant documents in top K) / (Total relevant documents) How much relevant content was missed?
Mean Average Precision (MAP) Mean of precision values at each rank where a relevant doc appears Single-number summary balancing both
NDCG@K Normalized Discounted Cumulative Gain at K Accounts for rank order; downweights relevant docs ranked lower

In practice, RAG systems typically optimize Recall@K (maximize coverage of relevant content) while constraining K (to fit the Context window) and accepting the precision trade-off.

Retrieval quality vs. generation quality

Retrieval metrics evaluate the retrieval component only. A high-recall retriever does not guarantee a high-quality generated answer:

  • The generator may fail to use retrieved content (faithfulness failure).
  • Retrieved content may be accurate but not optimally responsive to the query (relevance failure).
  • Relevant documents retrieved may themselves be wrong (corpus quality failure).

End-to-end RAG evaluation must combine retrieval metrics with generation metrics (faithfulness, answer relevance, Factual consistency).

Relevance judgments

Precision and recall require a relevance judgment for each document: is this document relevant to this query? Judgments may come from:

See also

References