Retrieval-augmented generation

Retrieval-augmented generation — An architecture that retrieves relevant documents at inference time and conditions text generation on their contents to ground answers in sources.

Overview

Retrieval-augmented generation (RAG) is a model architecture that augments a language model with a retrieval step: before generating a response, the system queries an external corpus and prepends the most relevant retrieved passages to the model's context. Generation then conditions on both the query and the retrieved content, reducing reliance on parametric (weight-encoded) knowledge and anchoring answers in specific documents.

RAG was formalized by Lewis et al. (2020) in the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, which demonstrated that retrieval-augmented generation outperforms pure parametric models on open-domain question answering.^[1]

In the LLM era RAG is widely deployed to make conversational AI systems more accurate, updatable, and traceable. It is the dominant architecture for source-attributed AI answers.

How it works

A standard RAG pipeline:

Index: a document corpus is chunked, embedded into vectors, and stored in a vector store.
Retrieve: at query time, the query is embedded and nearest-neighbor search returns the top-K chunks.
Generate: the LLM receives [system prompt] + [retrieved chunks] + [query] and generates a response.

Variations include:

Sparse retrieval (BM25-based) vs dense retrieval (embedding-based) vs hybrid.
Reranking — a second-pass model reorders retrieved chunks by relevance before generation.
Iterative/agentic RAG — the model issues multiple retrieval calls, refining based on intermediate output.

Distinction from related terms

Term	Relationship to RAG
Grounding	The goal RAG achieves; grounding is the outcome, RAG is the architecture
Document grounding	A subtype of grounding, often implemented with RAG
Fine-tuning	Encodes knowledge into model weights; RAG retrieves at inference time
Context window	RAG places retrieved content inside the context window
Hallucination	RAG reduces factuality hallucination but can introduce faithfulness hallucination

RAG is a specific retrieval-then-generate pattern. Grounding is the broader goal of anchoring outputs in sources; RAG is the most common architecture achieving it, but not the only one.

Evaluation

RAG systems have two separately measurable components:

Retrieval: precision and recall at K.
Generation (faithfulness): whether the answer's claims are supported by the retrieved passages — measured by faithfulness metrics (AlignScore, Ragas).
Answer relevance: whether the answer addresses the query, regardless of source.

Evaluating only the final answer without decomposing retrieval and generation quality makes diagnosis of failure modes difficult.

References

↑ Lewis, Patrick et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. https://arxiv.org/abs/2005.11401

[lewis-1] Lewis, Patrick et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. https://arxiv.org/abs/2005.11401

[1]

Anonymous

Search

Retrieval-augmented generation

Namespaces

More

Page actions

Contents

Overview

How it works

Distinction from related terms

Evaluation

See also

References

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Retrieval-augmented generation

Overview

How it works

Distinction from related terms

Evaluation

See also

References

Navigation

Wiki tools

Page tools

Categories