RLHF
Overview
Reinforcement Learning from Human Feedback (RLHF) is a training methodology that combines language model fine-tuning with reinforcement learning to align model behavior with human preferences. Rather than relying solely on supervised learning from human-annotated datasets, RLHF incorporates human judgments about output quality iteratively during training, allowing models to optimize for characteristics that may be difficult to specify formally (e.g., helpfulness, harmlessness, or factual accuracy).
The RLHF pipeline typically consists of three stages: initial model training or selection, collection of human preference comparisons, and optimization of a reward model that predicts human preferences. The reward model—itself a learned component—then guides the base model through reinforcement learning algorithms such as Proximal Policy Optimization (PPO) to maximize cumulative preference scores while remaining close to the original model distribution.
RLHF has become a standard technique in post-training of large language models, enabling practitioners to customize model behavior toward specific domains or organizational values without requiring full model retraining. The approach assumes that human annotators can reliably rank or score model outputs, and that this signal generalizes to held-out test distributions.
How it works
RLHF operates in three primary phases:
Phase 1: Supervised Fine-Tuning (SFT) A pre-trained foundation model is fine-tuned on a curated dataset of high-quality input-output pairs, typically generated by human experts. This initial step establishes a baseline behavior that is generally aligned but not fully optimized for human preferences.
Phase 2: Reward Model Training Human annotators compare pairs or sets of model outputs for identical prompts and rank them by preference (e.g., "Response A is better than Response B"). These preference judgments are used to train a separate reward model—usually a fine-tuned language model with a scalar output head—to predict which outputs humans prefer. The reward model learns to assign higher scores to outputs judged as better by annotators.
Phase 3: Reinforcement Learning Optimization The base model is optimized using a reinforcement learning algorithm (commonly Proximal Policy Optimization) with the reward model as the reward signal. During this phase, the model generates outputs in response to prompts, receives reward scores from the reward model, and its parameters are updated to maximize expected reward while maintaining distributional similarity to the SFT model (enforced via a Kullback-Leibler divergence penalty). This prevents the model from drifting too far from learned language patterns or exploiting reward model weaknesses.
The entire process may iterate: collecting additional human feedback on newly improved outputs, retraining the reward model, and further optimizing the base model.
| Term | Distinction |
|---|---|
| Prompt engineering | Prompt engineering modifies input instructions without changing model parameters; RLHF permanently updates model weights using human preference feedback. |
| Supervised fine-tuning | Supervised fine-tuning uses labeled input-output pairs to teach specific behaviors; RLHF uses comparative human judgments and learned reward functions to optimize preferences that may lack discrete labels. |
| In-context learning | In-context learning provides examples within a prompt to guide behavior at inference time; RLHF modifies underlying model weights to internalize preferences across all prompts. |
| Chain-of-thought | Chain-of-thought is a prompting technique to elicit intermediate reasoning; RLHF is a training methodology that can reinforce behaviors like reasoning when humans prefer them. |
| Instruction-based tuning (general) | Instruction tuning teaches a model to follow diverse instructions using supervised pairs; RLHF uses human-ranked comparisons and reward models to optimize for nuanced preference signals beyond literal instruction-following. |
Examples
InstructGPT / GPT-3.5 Training [1] OpenAI's InstructGPT applied RLHF to fine-tune GPT-3, collecting human preference judgments from annotators comparing model outputs on diverse tasks. A reward model trained on these comparisons guided PPO optimization, producing a model more aligned with user intent and less inclined to generate harmful content. This work demonstrated measurable improvements in human evaluator preference over the base model.
Constitutional AI and Anthropic's Claude [2] Anthropic extended RLHF with Constitutional AI (CAI), using a set of behavioral principles (a "constitution") to evaluate model outputs. Rather than relying exclusively on human annotations, a language model itself was instructed to score outputs against constitutional principles, reducing annotation cost while maintaining preference signals. This hybrid approach was applied to models in the Claude family to improve helpfulness and reduce harmful outputs.
Meta's Llama 2 Post-Training Llama 2's post-training pipeline incorporated RLHF with reward models trained on human preference data. Meta published model cards documenting the training procedure, including annotation guidelines and the composition of the reward dataset, providing transparency into alignment methodology used for a widely-deployed open-source model.
See also
- Fine-tuning — the broader family of weight-update techniques RLHF builds upon
- Prompt engineering — an alternative alignment method operating at inference time
- LLM-as-judge — using language models as evaluators, related to reward model construction
- Foundation model — pre-trained base models that RLHF post-processes
- System prompt — a complementary mechanism for steering model behavior without retraining