Prompt caching

From llmref.wiki
Prompt caching — Mechanism to reuse pre-computed token embeddings (KV-cache) of static prompt prefixes across multiple requests to reduce latency and computational cost.

Overview

Prompt caching is an optimization technique that stores and reuses the key-value cache representations of unchanging portions of a prompt across sequential inference calls. Rather than recomputing embeddings and attention weights for identical prompt prefixes on each request, cached representations are retrieved from memory, reducing both computational overhead and wall-clock latency.

The technique addresses a practical inefficiency in large language model deployment: when users or systems make multiple requests with the same system instructions, context blocks, or knowledge bases, the model would conventionally reprocess these identical tokens from scratch each time. Prompt caching intercepts this redundancy at the transformer architecture level by storing the KV-cache—the pre-computed key and value matrices produced during the first forward pass of the static prefix.

This approach is distinct from traditional caching of complete model outputs. Rather than storing final text responses, prompt caching operates on intermediate representations within the model's computational graph, enabling chained prompts and dynamic completions while reusing expensive prefix computations. The technique has become particularly relevant as context windows expand and multi-turn agent memory scenarios become common.

How it works

Prompt caching operates at three stages:

Token-level prefix identification: The system identifies static portions of the prompt that will remain constant across multiple invocations. These are typically system instructions, retrieval-augmented RAG context blocks, or shared knowledge bases. Boundaries between cacheable and dynamic content must be explicitly marked or inferred by the inference engine.

KV-cache generation and storage: On the first request containing a cacheable prefix, the model performs standard forward propagation through the attention mechanism for those tokens. The resulting key and value matrices—the KV-cache—are retained in GPU or system memory with an associated hash or token-sequence identifier. This cache occupies significant memory (roughly proportional to prefix token count × hidden dimension × number of layers) but represents a one-time computation.

Cache retrieval and continuation: On subsequent requests sharing the same prefix, the inference engine bypasses token embedding and attention computation for cached tokens. Instead, it loads the pre-computed KV-cache and begins forward propagation from the point where the cache ends, processing only new (dynamic) tokens. The model generates continuations as though it had processed the entire prompt, since attention mechanism outputs for the prefix are already available.

The cache is typically invalidated if the prefix changes, or managed probabilistically with a token-level granularity (caching at 1024-token boundaries, for instance). Some implementations use hierarchical caching strategies that cache multiple prefix lengths to balance memory overhead against reuse frequency.

Distinction from related terms

Term Distinction
Prompt chaining Prompt chaining sequences multiple separate API calls with output from one feeding into the next; prompt caching optimizes repeated calls to the same prefix by reusing intermediate representations within a single model instance.
In-context learning In-context learning refers to the model's ability to adapt behavior based on examples or instructions provided in the prompt; prompt caching is an infrastructure optimization that reuses the computed representations of in-context examples without changing the learning mechanism.
Context window management Context window refers to the maximum length of input a model can process; prompt caching addresses latency and cost of processing static content within a context window, not the window size itself.
Retrieval-augmented generation (RAG) RAG dynamically retrieves relevant documents to include in each prompt; prompt caching can optimize RAG by caching the static retrieval prefix, but RAG concerns content selection while prompt caching concerns computation reuse.
Model output caching Output caching stores final generated text responses; prompt caching stores intermediate KV representations, enabling regeneration with different continuations while reusing prefix computation.

Examples

Multi-turn customer support systems: A stateful agent maintaining a support conversation reuses a cached prompt containing company policy documents and customer relationship data. The first message is processed in full (cache miss); subsequent messages in the same conversation skip recomputation of the policy prefix, reducing per-message latency by 40–60% depending on prefix size.

Batch retrieval-augmented generation: A RAG pipeline caches the combined text of retrieved documents (e.g., 10,000 tokens) after an embedding model performs contextual retrieval. When multiple queries are answered over the same corpus in a session, the cached document prefix eliminates redundant transformer computation, reducing total inference time proportionally to static content size.

Constitutional AI compliance checking: A system using Constitutional AI caches the set of constitution rules and guardrail instructions that are identical across thousands of model calls. By caching the KV-cache of these static governance tokens, each request processes only the user query and response candidate, reducing model latency by the computational cost of the prefix.

See also

References