Retrieval-augmented generation
Overview
Retrieval-augmented generation (RAG) is a model architecture that augments a language model with a retrieval step: before generating a response, the system queries an external corpus and prepends the most relevant retrieved passages to the model's context. Generation then conditions on both the query and the retrieved content, reducing reliance on parametric (weight-encoded) knowledge and anchoring answers in specific documents.
RAG was formalized by Lewis et al. (2020) in the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, which demonstrated that retrieval-augmented generation outperforms pure parametric models on open-domain question answering.[1]
In the LLM era RAG is widely deployed to make conversational AI systems more accurate, updatable, and traceable. It is the dominant architecture for source-attributed AI answers.
How it works
A standard RAG pipeline:
- Index: a document corpus is chunked, embedded into vectors, and stored in a vector store.
- Retrieve: at query time, the query is embedded and nearest-neighbor search returns the top-K chunks.
- Generate: the LLM receives [system prompt] + [retrieved chunks] + [query] and generates a response.
Variations include:
- Sparse retrieval (BM25-based) vs dense retrieval (embedding-based) vs hybrid.
- Reranking — a second-pass model reorders retrieved chunks by relevance before generation.
- Iterative/agentic RAG — the model issues multiple retrieval calls, refining based on intermediate output.
| Term | Relationship to RAG |
|---|---|
| Grounding | The goal RAG achieves; grounding is the outcome, RAG is the architecture |
| Document grounding | A subtype of grounding, often implemented with RAG |
| Fine-tuning | Encodes knowledge into model weights; RAG retrieves at inference time |
| Context window | RAG places retrieved content inside the context window |
| Hallucination | RAG reduces factuality hallucination but can introduce faithfulness hallucination |
RAG is a specific retrieval-then-generate pattern. Grounding is the broader goal of anchoring outputs in sources; RAG is the most common architecture achieving it, but not the only one.
Evaluation
RAG systems have two separately measurable components:
- Retrieval: precision and recall at K.
- Generation (faithfulness): whether the answer's claims are supported by the retrieved passages — measured by faithfulness metrics (AlignScore, Ragas).
- Answer relevance: whether the answer addresses the query, regardless of source.
Evaluating only the final answer without decomposing retrieval and generation quality makes diagnosis of failure modes difficult.
See also
- Grounding vs RAG
- Faithfulness vs Groundedness
- Context window
- Source attribution (AI)
- Hallucination
- Prompt engineering
References
- ↑ Lewis, Patrick et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. https://arxiv.org/abs/2005.11401