Latency vs throughput (LLM)

From llmref.wiki
Latency vs throughput (LLM) — Trade-off between response initiation speed (latency) and sustained output rate (throughput) in LLM inference.

Overview

Latency and throughput are two fundamental performance dimensions in LLM systems that often exist in tension. **Latency** measures the time elapsed from request submission to receipt of the first output token (time-to-first-token, TTFT), while **throughput** measures the total number of tokens generated per unit time across all concurrent requests. These metrics reflect different optimization priorities: latency optimizes for user-perceived responsiveness, while throughput optimizes for aggregate system capacity and cost efficiency.

The latency–throughput trade-off emerges from hardware constraints and inference architecture decisions. A system configured to minimize latency typically processes requests with minimal batching, keeping GPU memory available for fast response at the cost of underutilizing hardware. Conversely, throughput-optimized systems batch multiple requests together, achieving higher token generation rates per second but increasing latency for individual users as they wait for batch formation.

This trade-off is not merely theoretical; it determines whether an LLM deployment prioritizes interactive use cases (chatbots, real-time code completion) versus batch processing (content generation, report synthesis). Organizations must choose configurations based on workload characteristics, user expectations, and cost constraints.

How it is measured

    • Latency** is measured as time-to-first-token (TTFT) in milliseconds or seconds, calculated as the interval from request receipt to the arrival of the first output token at the client. This metric includes tokenization overhead, model forward pass time, and network transit time.
    • Throughput** is measured as tokens per second (tokens/sec) aggregated across all active requests over a measurement window. It is calculated as:

<math>\text{Throughput} = \frac{\text{Total tokens generated}}{\text{Time interval (seconds)}}</math>

In practice, two related metrics also matter:

  • **Time-per-output-token (TPOT)**: latency between consecutive output tokens, typically 50–150 ms per token on modern hardware.
  • **Batch size**: the number of concurrent requests processed together; larger batches reduce per-token latency but increase per-request latency (time spent waiting for batch formation).

Measurement of each metric requires specification of hardware (GPU type, memory), model size, input sequence length, and output sequence length, as these all affect observed performance.

Distinction from related terms

Term Distinction
Context window Context window defines the maximum input length an LLM accepts; latency and throughput both scale with context length, but context window is a static model property, while latency/throughput are runtime measurements.
Prompt caching Prompt caching reduces latency and improves throughput for repeated requests by storing intermediate computation; it is a technique to improve both metrics rather than a trade-off itself.
LLM Optimization LLM optimization encompasses all strategies (quantization, distillation, batching) that improve latency, throughput, or both; latency–throughput trade-off describes the inherent constraint that simultaneous optimization of both is limited.
Inference infrastructure Inference infrastructure is the hardware and software layer (GPUs, serving frameworks) on which latency and throughput are realized; the trade-off emerges from how that infrastructure is configured.
Benchmark contamination Benchmark contamination concerns whether models have seen test data; latency–throughput metrics are measured independently of benchmark validity.

Examples

    • Example 1: Interactive chatbot vs. batch content generation.**

A chatbot serving real-time user queries prioritizes sub-500 ms TTFT; it uses small batch sizes (1–4 requests) and GPU memory reserved for rapid inference. Throughput may be only 100 tokens/sec. A content generation service that processes 100,000 document summaries overnight batches 64–128 requests together, achieving 5,000 tokens/sec throughput, but individual requests experience 5–10 seconds of latency while awaiting batch formation. Both systems use the same model but optimize different metrics.

    • Example 2: vLLM's paged attention scheduling.**

The vLLM inference framework[1] allows dynamic batching that improves throughput over naive batching, but still requires batching decisions that increase TTFT relative to single-request execution. Users can tune the `max_num_batched_tokens` parameter to trade TTFT for throughput based on workload.

    • Example 3: Token streaming vs. full-response delivery.**

A system that streams tokens to the user as they are generated can report low TTFT (first token arrives in 50–100 ms) and high throughput (1000+ tokens/sec), but the final complete response may take 30 seconds to arrive. A system that waits for the full response before transmission has higher TTFT but appears to deliver a complete answer once it arrives; throughput calculations may differ depending on measurement window.

See also

  • Inference infrastructure — hardware and software systems where latency–throughput trade-offs are realized
  • Large language model — foundational concept; all LLMs exhibit latency–throughput properties
  • Prompt caching — technique to reduce both latency and increase throughput for repeated queries
  • LLM Optimization — strategies to improve latency, throughput, or balance between them
  • Reasoning model — reasoning-focused LLMs often accept higher latency in exchange for quality; throughput is secondary

References

  1. Kwon et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023.