Streaming output
Overview
Streaming output is a delivery mechanism for large language model (LLM) responses in which individual tokens are transmitted to the client as they are generated by the model, rather than buffering the entire completion and sending it after inference completes. This approach reduces perceived latency and enables real-time interaction patterns, particularly in conversational interfaces.
In traditional batch processing, a model generates all tokens sequentially in memory, then returns the complete response. Streaming decouples token generation from transmission, allowing consumers to begin processing output while the model continues inference. This is especially valuable in user-facing applications where waiting for full completion creates noticeable delays, and in applications requiring progressive refinement or early-stopping logic.
Streaming output interacts with multiple system dimensions: Inference infrastructure must support token-by-token emission; latency and throughput trade-offs shift in favor of perceived responsiveness over raw throughput; and downstream systems may implement Prompt caching or early termination strategies. The technique has become standard in consumer-facing LLM products and answer engines.
How it works
In a streaming architecture, the model's inference loop yields control at each token generation step. The token is serialized (often as JSON over HTTP Server-Sent Events, WebSocket, or other streaming protocols) and transmitted immediately. The client receives tokens in near-real-time and may render, buffer, or process them as they arrive.
Token emission typically follows these steps:
- Model generates token <math>t_i</math> from the prior distribution <math>p(t_i | t_1, \ldots, t_{i-1})</math>
- Token ID is converted to text representation
- Text fragment is wrapped in a message envelope (protocol-dependent)
- Message is flushed to the network buffer without waiting for subsequent tokens
- Client receives and processes message independently
- Model continues to <math>t_{i+1}</math> while client-side processing occurs in parallel
This requires client-side buffering logic to handle out-of-order or delayed arrivals, particularly in multi-token scenarios where word-piece tokenization may require accumulation before rendering. Error handling must address disconnections mid-stream, incomplete messages, and recovery semantics (retry vs. restart).
From an infrastructure perspective, streaming imposes constraints: models must be capable of yielding control (not all inference engines support this natively), and the serving system must maintain connection state per client. Batch inference becomes more complex when clients join and leave mid-generation, requiring careful scheduling to amortize token computation across heterogeneous request lifecycles.
| Term | Distinction |
|---|---|
| Batch inference | Batch inference groups multiple requests and processes them together for throughput efficiency; streaming output focuses on per-token transmission timing for latency reduction. A system may use batch inference at the model layer while streaming tokens to individual clients. |
| Latency vs throughput | Latency and throughput are performance dimensions; streaming output is a technique that typically reduces time-to-first-token (latency) while potentially reducing overall throughput by breaking batch parallelism. |
| Prompt caching | Prompt caching reduces recomputation of static context; streaming output reduces latency for dynamic token delivery. Both address responsiveness but at different points in the inference pipeline. |
| Context window | Context window is the maximum sequence length a model can process; streaming output is orthogonal and applies to any sequence length, determining how completion is *delivered* rather than what can be *processed*. |
Examples
OpenAI's ChatGPT API supports streaming via the `stream=true` parameter, emitting completion tokens as SSE (Server-Sent Events) formatted JSON objects. Clients receive incremental `"delta"` text fragments and can display them character-by-character without waiting for the full response.
Anthropic's Claude API similarly implements streaming, with tokens arriving in `content_block_delta` events. This is used in consumer interfaces (claude.ai) where users see text appear progressively as the model generates it.
Google's Gemini API offers streaming for both text and multimodal outputs. The gRPC-based protocol streams `content` messages containing token-level detail, allowing progressive rendering in mobile and web clients where network latency would otherwise cause noticeable blocking.
See also
- Inference infrastructure — systems and protocols enabling token-level output control
- Latency vs throughput (LLM) — performance trade-offs affected by streaming design
- Batch inference — alternative aggregation strategy with different latency characteristics
- Prompt caching — complementary technique for reducing inference time in other phases
- Large language model — foundational concept for understanding token generation mechanics