Token budget
Overview
A token budget is a computational and financial constraint mechanism imposed on LLM interactions to cap the total number of tokens that may be consumed during a single request, conversation session, or billing period. Token budgets serve as a primary cost-control mechanism in production LLM systems, where each token processed incurs measurable computational expense and inference latency.
Token budgets operate at multiple scope levels: per-API call, per-user session, per-application, or across an entire organization. They differ fundamentally from context windows, which define the maximum tokens an model can process at once, whereas token budgets limit cumulative token consumption across time. When a token budget is exhausted, the system typically truncates output, rejects further requests, or raises a quota-exceeded error.
The budget constraint creates a direct trade-off between response quality and cost. Systems must balance longer, more thorough RAG pipeline executions, multi-step chain-of-thought reasoning, or extended agent memory retention against strict token limits. This necessitates deliberate choices in prompt engineering, chunking strategies, and orchestration patterns to maximize utility within fixed token allocations.
How it works
Token budgets are implemented as counters that accumulate token consumption across three primary phases of an LLM request:
- Input tokens: Tokens in the system prompt, user query, retrieved context, prompt templates, and any prior conversation history.
- Output tokens: Tokens generated in the model's response. Most pricing models charge output tokens at a higher per-token rate than input tokens.
- Overhead tokens: Tokens consumed by intermediate steps such as query rewriting, reranking, or evaluation in multi-step workflows.
When the cumulative total approaches or exceeds the budget threshold, systems implement graceful degradation strategies:
- Truncating retrieved documents to prioritize most-relevant sections
- Reducing the number of reasoning steps in agentic workflows
- Switching to lower-token-cost models for certain subtasks
- Queuing requests to distribute tokens across time windows
Token budgets are typically configured via API parameters (e.g., `max_tokens`, `max_completion_tokens`) or organizational policies enforced at the inference infrastructure level. Cloud providers such as OpenAI, Anthropic, and Google offer per-request budgets; frameworks like Model Context Protocol allow fine-grained token allocation across tool calls and multi-agent orchestration.
| Term | Distinction |
|---|---|
| Context window | Context window is the maximum tokens the model architecture can process in a single forward pass; token budget is a consumption limit across time or requests. A model with a 200K context window can be given a 1K token budget per API call. |
| Agent memory | Agent memory is cumulative state retained across multiple interactions; token budget governs how many tokens can be consumed to retrieve, encode, or process that memory. A long-term memory system may be expensive within a tight token budget. |
| Inference cost | Inference cost is the actual dollar amount charged; token budget is the unit-level limit. Billing multiplies tokens consumed by per-token rates; budgets enforce the constraint before billing occurs. |
| Rate limiting | Rate limiting caps requests per time unit (e.g., 10 requests/minute); token budget caps cumulative tokens per request or session regardless of frequency. |
| Output constraints | Output constraints specify the format or structure of responses; token budgets specify the size. Constraints are qualitative; budgets are quantitative. |
Examples
- OpenAI API per-request budget: A developer calls the GPT-4 Turbo API with `max_tokens=500`, enforcing a hard cap on completion tokens regardless of the model's training to avoid runaway outputs. Input tokens are not capped separately, but total token usage is logged and billed.[1]
- Anthropic Messages API with max_tokens_to_sample: The Anthropic API enforces a `max_tokens_to_sample` parameter (e.g., 1024 tokens) that applies only to output; input tokens are counted separately. This allows organizations to provision a per-user monthly token budget and meter requests independently.
- Agentic workflow token budgeting: A multi-agent system performing customer service allocates 5000 tokens per request. A critic agent performing automated evaluation consumes 800 tokens; a RAG pipeline retrieves and reranks documents (1200 tokens); the primary agent responds (1000 tokens), leaving 2000 tokens for query rewriting or follow-up reasoning. If the budget is exceeded, the system truncates retrieved documents or disables multi-step ReAct loops.
See also
- Context window — The architectural maximum tokens processable per forward pass
- Inference infrastructure — The systems that enforce token budgets and meter consumption
- Prompt engineering — Techniques to minimize token usage within budgets
- Retrieval-augmented generation — A major consumer of tokens; often requires budget optimization
- Multi-agent orchestration — Distributes token budgets across multiple agents and steps
References
- ↑ OpenAI. "Tokens." API Documentation. https://platform.openai.com/docs/guides/tokens