Retrieval precision and recall
Overview
Retrieval precision and retrieval recall are the two fundamental metrics for evaluating the quality of a retrieval system — the component of a RAG pipeline that fetches documents from a corpus in response to a query. They originate in classical information retrieval and are directly applicable to dense and sparse retrieval in LLM pipelines.
- Precision@K: of the K documents returned, what fraction are relevant?
- Recall@K: of all relevant documents in the corpus, what fraction appear in the top K returned?
Precision and recall trade off against each other as K increases: a larger K typically raises recall (more relevant documents retrieved) but lowers precision (a higher proportion of irrelevant documents is included).
Notation and variants
| Metric | Formula | Interpretation |
|---|---|---|
| Precision@K | (Relevant documents in top K) / K | How much noise is in the retrieved set? |
| Recall@K | (Relevant documents in top K) / (Total relevant documents) | How much relevant content was missed? |
| Mean Average Precision (MAP) | Mean of precision values at each rank where a relevant doc appears | Single-number summary balancing both |
| NDCG@K | Normalized Discounted Cumulative Gain at K | Accounts for rank order; downweights relevant docs ranked lower |
In practice, RAG systems typically optimize Recall@K (maximize coverage of relevant content) while constraining K (to fit the Context window) and accepting the precision trade-off.
Retrieval quality vs. generation quality
Retrieval metrics evaluate the retrieval component only. A high-recall retriever does not guarantee a high-quality generated answer:
- The generator may fail to use retrieved content (faithfulness failure).
- Retrieved content may be accurate but not optimally responsive to the query (relevance failure).
- Relevant documents retrieved may themselves be wrong (corpus quality failure).
End-to-end RAG evaluation must combine retrieval metrics with generation metrics (faithfulness, answer relevance, Factual consistency).
Relevance judgments
Precision and recall require a relevance judgment for each document: is this document relevant to this query? Judgments may come from:
- Human annotators building a golden dataset.
- LLM-as-judge prompts.
- Proxy signals (click-through, downstream answer correctness).