Speculative decoding

From llmref.wiki
Speculative decoding — A decoding strategy where a smaller draft model proposes token sequences verified in parallel by a larger model to reduce latency.

Overview

Speculative decoding is an inference optimization technique that accelerates large language model token generation by parallelizing the verification of proposed tokens. A smaller, faster "draft model" generates candidate tokens autoregressively, while a larger "verifier model" processes multiple candidates simultaneously, accepting or rejecting them in a single forward pass. This decouples the typically sequential nature of token-by-token generation, reducing wall-clock latency while maintaining output distribution equivalence to sampling directly from the larger model.

The approach addresses a fundamental bottleneck in LLM latency: autoregressive sampling requires one forward pass per token, making latency proportional to output length. By amortizing computation across multiple candidate tokens, speculative decoding reduces the number of verifier forward passes required without changing the probability distribution of generated text. The draft model must be substantially smaller (or use quantized weights) to remain faster than the verifier, making the trade-off economically viable.

Speculative decoding assumes access to a verifier model, typically a larger or higher-capacity variant than the draft model. It is most effective when the verifier can perform batch inference on proposed tokens with minimal overhead, particularly in settings with sufficient batch diversity or when draft-model accuracy is high enough to accept multiple tokens per verification step.

How it works

The algorithm proceeds in stages:

1. Draft generation: The draft model autoregressively samples a sequence of k tokens (e.g., 4–8), one per forward pass, conditioned on the current context.

2. Parallel verification: The verifier model receives the context plus all k draft tokens and performs a single forward pass, generating logits for each position.

3. Token-by-token acceptance: Starting from the first draft token, the verifier's predicted probability is compared to the draft probability at each position. If the verifier probability exceeds the draft probability (or meets a rejection threshold), the token is accepted and becomes part of the final output. If rejected, the verifier's probability is used to resample that position, and the process stops.

4. Iteration: The context window is updated with accepted (and possibly resampled) tokens, and the cycle repeats until an end-of-sequence token is generated or the output limit is reached.

The verifier's forward pass cost is amortized because it processes k positions in a single pass rather than k separate passes. The draft model's multiple forward passes are cheaper due to its smaller size. The net latency reduction depends on the ratio of draft-to-verifier throughput and the acceptance rate—higher acceptance rates reduce the number of verifier invocations needed.

Variants include batched speculative decoding (processing multiple sequences simultaneously with shared draft proposals) and hierarchical drafting (cascading multiple smaller models before the verifier).

Distinction from related terms

Term Distinction
Batch inference Batch inference processes multiple independent sequences in parallel; speculative decoding parallelizes token candidates within a single sequence to reduce latency per token.
Mixture of Experts (MoE) Mixture of Experts routes tokens to different expert subnetworks within a single model; speculative decoding uses two separate models (draft and verifier) to reduce forward pass count.
Prompt caching Prompt caching reuses cached activations for repeated context; speculative decoding proposes and verifies multiple token candidates per inference iteration to reduce sequential latency.
In-context learning In-context learning is a prompting technique that conditions model behavior on examples; speculative decoding is an inference-time optimization orthogonal to prompting strategy.

Examples

  • Medusa (jointly developed by MIT-IBM and others): A speculative decoding approach using multiple draft heads on the same model backbone. The base model generates multiple candidate tokens in parallel heads, which are then verified against the model's next-token distribution. This reduced latency on open-weight models by 2–3× without retraining.
  • Google Gemini's assisted generation: Internal deployments use a smaller draft model (e.g., a distilled or quantized variant) to propose tokens, which a full-scale verifier model accepts or rejects. Reported to improve latency by 40–50% on typical completions.
  • vLLM speculative decoding mode: The vLLM inference engine implements speculative decoding with configurable draft and verifier models, allowing practitioners to pair models of different sizes (e.g., a 7B draft with a 70B verifier) to reduce end-to-end latency on batch and streaming workloads.

See also

References