Context window

From llmref.wiki
Context window — The bounded sequence of tokens a model can process in a single forward pass, spanning prompt, retrieved context, conversation history, and output.

Overview

The context window (also context length or token limit) is the maximum number of tokens a language model can process in a single inference call. Everything passed to the model — the system prompt, conversation history, retrieved documents, tool results, and the model's own generated output — must fit within this limit. Tokens that exceed the window are truncated or cannot be included.

Context window size is set by the model's architecture (specifically the attention mechanism's positional encoding range) and determines the maximum scope of information a model can reason about simultaneously in one call.

Common context windows as of 2024–2025 range from 8K to 1M+ tokens, with 1 token approximating ~0.75 words in English text.

What counts toward the window

Component Counts toward window
System prompt Yes
Conversation history (all turns) Yes
Retrieved documents (RAG chunks) Yes
Tool call inputs and outputs Yes
Model's response (output tokens) Yes (output tokens may have a separate sub-limit)
Model weights / parametric knowledge No — not token-based

Distinction from agent memory

The context window is not the same as agent memory:

  • The context window is temporary — it is reset at the end of each API call session; nothing in it persists unless explicitly stored elsewhere.
  • Agent memory is a design pattern where important information is extracted from the context window and written to a persistent store (database, file, embeddings index) so it can be retrieved in future calls.
  • Increasing the context window does not create memory: a 1M-token window still holds nothing between sessions unless the application re-injects prior context.

A longer context window reduces the frequency of needing explicit memory management but does not eliminate it, especially for long-running agents or multi-session applications.

Context window and retrieval-augmented generation

RAG (Retrieval-augmented generation) pipelines populate the context window with retrieved document chunks. The context window sets an upper bound on how many retrieved chunks can be considered at once. Larger windows allow more retrieved content but also introduce lost-in-the-middle degradation: empirically, models attend less reliably to content in the center of very long contexts.[1]

See also

References

  1. Liu, Nelson F. et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL 2024. https://arxiv.org/abs/2307.03172