Context window

Context window — The bounded sequence of tokens a model can process in a single forward pass, spanning prompt, retrieved context, conversation history, and output.

Overview

The context window (also context length or token limit) is the maximum number of tokens a language model can process in a single inference call. Everything passed to the model — the system prompt, conversation history, retrieved documents, tool results, and the model's own generated output — must fit within this limit. Tokens that exceed the window are truncated or cannot be included.

Context window size is set by the model's architecture (specifically the attention mechanism's positional encoding range) and determines the maximum scope of information a model can reason about simultaneously in one call.

Common context windows as of 2024–2025 range from 8K to 1M+ tokens, with 1 token approximating ~0.75 words in English text.

What counts toward the window

Component	Counts toward window
System prompt	Yes
Conversation history (all turns)	Yes
Retrieved documents (RAG chunks)	Yes
Tool call inputs and outputs	Yes
Model's response (output tokens)	Yes (output tokens may have a separate sub-limit)
Model weights / parametric knowledge	No — not token-based

Distinction from agent memory

The context window is not the same as agent memory:

The context window is temporary — it is reset at the end of each API call session; nothing in it persists unless explicitly stored elsewhere.
Agent memory is a design pattern where important information is extracted from the context window and written to a persistent store (database, file, embeddings index) so it can be retrieved in future calls.
Increasing the context window does not create memory: a 1M-token window still holds nothing between sessions unless the application re-injects prior context.

A longer context window reduces the frequency of needing explicit memory management but does not eliminate it, especially for long-running agents or multi-session applications.

Context window and retrieval-augmented generation

RAG (Retrieval-augmented generation) pipelines populate the context window with retrieved document chunks. The context window sets an upper bound on how many retrieved chunks can be considered at once. Larger windows allow more retrieved content but also introduce lost-in-the-middle degradation: empirically, models attend less reliably to content in the center of very long contexts.^[1]

References

↑ Liu, Nelson F. et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL 2024. https://arxiv.org/abs/2307.03172

[litm-1] Liu, Nelson F. et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL 2024. https://arxiv.org/abs/2307.03172

[1]

Anonymous

Search

Context window

Namespaces

More

Page actions

Contents

Overview

What counts toward the window

Distinction from agent memory

Context window and retrieval-augmented generation

See also

References

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Context window

Overview

What counts toward the window

Distinction from agent memory

Context window and retrieval-augmented generation

See also

References

Navigation

Wiki tools

Page tools

Categories