Context window
Overview
The context window (also context length or token limit) is the maximum number of tokens a language model can process in a single inference call. Everything passed to the model — the system prompt, conversation history, retrieved documents, tool results, and the model's own generated output — must fit within this limit. Tokens that exceed the window are truncated or cannot be included.
Context window size is set by the model's architecture (specifically the attention mechanism's positional encoding range) and determines the maximum scope of information a model can reason about simultaneously in one call.
Common context windows as of 2024–2025 range from 8K to 1M+ tokens, with 1 token approximating ~0.75 words in English text.
What counts toward the window
| Component | Counts toward window |
|---|---|
| System prompt | Yes |
| Conversation history (all turns) | Yes |
| Retrieved documents (RAG chunks) | Yes |
| Tool call inputs and outputs | Yes |
| Model's response (output tokens) | Yes (output tokens may have a separate sub-limit) |
| Model weights / parametric knowledge | No — not token-based |
Distinction from agent memory
The context window is not the same as agent memory:
- The context window is temporary — it is reset at the end of each API call session; nothing in it persists unless explicitly stored elsewhere.
- Agent memory is a design pattern where important information is extracted from the context window and written to a persistent store (database, file, embeddings index) so it can be retrieved in future calls.
- Increasing the context window does not create memory: a 1M-token window still holds nothing between sessions unless the application re-injects prior context.
A longer context window reduces the frequency of needing explicit memory management but does not eliminate it, especially for long-running agents or multi-session applications.
Context window and retrieval-augmented generation
RAG (Retrieval-augmented generation) pipelines populate the context window with retrieved document chunks. The context window sets an upper bound on how many retrieved chunks can be considered at once. Larger windows allow more retrieved content but also introduce lost-in-the-middle degradation: empirically, models attend less reliably to content in the center of very long contexts.[1]
See also
- Agent memory vs Context window
- Retrieval-augmented generation
- Tokenization
- System prompt
- Fundamentals
References
- ↑ Liu, Nelson F. et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL 2024. https://arxiv.org/abs/2307.03172