Temperature (sampling)

From llmref.wiki
Temperature (sampling) — A decoding parameter that scales the probability distribution over next tokens, controlling the diversity and randomness of a model's output.

Overview

Temperature (in the context of language model decoding) is a scalar parameter that modifies the logit distribution over the vocabulary before sampling the next token. At temperature 1.0, the model samples from its trained distribution without modification. Lower temperatures concentrate probability mass on the highest-probability tokens, making outputs more deterministic and repetitive. Higher temperatures flatten the distribution, increasing diversity and unpredictability.

Temperature is a property of the sampling decoding strategy, not of the model itself — the same model produces different behavior at different temperatures.

Effect on output

Temperature Effect Typical use
0 (or ≈0) Greedy / near-deterministic: always selects highest-probability token Factual Q&A, extraction, classification
0.1–0.5 Low diversity; focused, consistent outputs Code generation, structured output
0.7–1.0 Balanced diversity; default range for most chat applications Conversational assistants
>1.0 High diversity; increased risk of incoherence Creative generation experiments

The mislabeling of temperature as creativity is a common oversimplification. Temperature does not give the model new capabilities or knowledge; it only adjusts how it samples from existing probability distributions. A model at high temperature can produce text that is creative-sounding but also semantically incoherent or factually wrong — higher temperature increases both diversity and error rate.

Interaction with top-p and top-k

Temperature is typically used alongside:

  • Top-k sampling: truncate the distribution to the K highest-probability tokens before sampling.
  • Top-p (nucleus) sampling: sample from the smallest set of tokens whose cumulative probability exceeds p.

These parameters interact: at temperature 0 with top-k=1, output is deterministic. At temperature=1 with no top-k/p restrictions, all tokens have non-zero probability. In practice, APIs expose temperature as the primary dial; top-p and top-k are secondary.

Determinism caveat

Even at temperature 0, most production APIs are not fully deterministic due to floating-point non-associativity across hardware configurations and batching. Applications requiring strict reproducibility should not rely on temperature 0 alone; they should log and store outputs.

See also

References