Self-consistency (prompting)

From llmref.wiki
Self-consistency (prompting) — Sampling multiple chain-of-thought reasoning paths and aggregating outputs via majority voting to improve answer reliability.

Overview

Self-consistency is a prompt engineering technique that mitigates the variability inherent in sampling-based language model inference. Rather than accepting a single generated reasoning path and final answer, self-consistency generates multiple independent chain-of-thought trajectories from the same input prompt, then aggregates the final answers using majority voting or other ensemble methods.[1]

The method operates under the principle that diverse reasoning paths may converge on a correct answer despite variations in intermediate steps. By sampling multiple solutions and selecting the most frequently occurring final answer, self-consistency reduces the impact of individual reasoning errors or model hallucinations that might affect a single inference pass. This approach is particularly effective on tasks requiring multi-step reasoning, such as arithmetic, commonsense reasoning, and symbolic manipulation.

Self-consistency does not require retraining or fine-tuning the underlying model. It leverages existing sampling capabilities of language models through temperature variation or other stochastic decoding mechanisms, making it an accessible and computationally tractable augmentation to chain-of-thought prompting. The technique trades inference cost (multiple forward passes) for improved factual consistency and reduced hallucination rates.

How it works

Self-consistency operates in three stages:

Sampling stage: Given a prompt and chain-of-thought instruction, the model generates k independent reasoning trajectories. Each trajectory is sampled with non-zero temperature (typically 0.5–1.0) to introduce diversity while maintaining coherence. The number of samples k is a hyperparameter; common values range from 5 to 40 depending on task complexity and computational budget.

Aggregation stage: For each of the k sampled trajectories, the model's final answer is extracted—typically the last numerical value, category label, or explicit conclusion statement. The answers are collected into a multiset and counted.

Voting stage: The final output is determined by majority vote: the answer appearing most frequently across all k samples is returned. In case of ties, tie-breaking strategies may include selecting the first or highest-confidence answer, or reverting to a baseline single-pass response.

The method is agnostic to the specific chain-of-thought formatting used. It works with natural language reasoning, step-by-step calculations, and structured reasoning schemas. The quality of self-consistency output depends on both the diversity of sampled paths (controlled by temperature and sampling method) and the underlying model's baseline capability on the task.

Distinction from related terms

Term Distinction
Chain-of-thought (CoT) CoT is a prompt format that instructs the model to show reasoning steps before answering. Self-consistency *uses* CoT as input but adds ensemble sampling and voting; CoT alone generates a single trajectory.
Prompt chaining Prompt chaining sequences multiple model calls where output from one step feeds into the next. Self-consistency runs parallel, independent sampling of the same prompt and aggregates via voting rather than sequential composition.
LLM-as-judge LLM-as-judge uses a separate model instance to evaluate or select among multiple candidate answers. Self-consistency uses deterministic majority voting over samples from the same model without additional evaluation.
Automated evaluation Automated evaluation measures the quality of generated outputs against reference standards. Self-consistency is a generation technique that *improves* output quality; it is not itself an evaluation method, though its outputs may be evaluated.
Multi-agent orchestration Multi-agent orchestration coordinates multiple distinct agents or model instances with different roles. Self-consistency uses a single model and generates replicate samples of the same task; no role differentiation or inter-agent communication occurs.

Examples

Arithmetic reasoning: On the GSM8K (Grade School Math) dataset, a model prompted with "Let's think step by step" generates 40 independent reasoning chains for a word problem. The model samples different intermediate calculation approaches, some correct and some containing arithmetic errors. Majority voting over the 40 final numerical answers selects the most common result, improving accuracy from ~60% (single-pass CoT) to ~75–80% (self-consistency with 40 samples).[1]

Commonsense reasoning: On commonsense QA tasks (e.g., selecting the most plausible explanation for an event), self-consistency with 5 samples increases accuracy on datasets like CommonsenseQA and StrategyQA by aggregating diverse but locally coherent explanations, each arriving at different reasoning paths but converging on the same correct answer.

Symbolic manipulation: For tasks requiring precise logical reasoning or equation solving, self-consistency with temperature 0.5 generates multiple symbolic derivations. When 7 out of 10 samples reach the same final symbolic form (e.g., a simplified equation), that form is selected, reducing the likelihood of algebraic errors present in individual inference passes.

See also

References

  1. 1.0 1.1 Wang, Xuezhi et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv preprint arXiv:2203.11171 (2023).