RLHF

RLHF — Technique of fine-tuning language models using human preference feedback to align outputs with human values and instructions.

Overview

Reinforcement Learning from Human Feedback (RLHF) is a training methodology that combines language model fine-tuning with reinforcement learning to align model behavior with human preferences. Rather than relying solely on supervised learning from human-annotated datasets, RLHF incorporates human judgments about output quality iteratively during training, allowing models to optimize for characteristics that may be difficult to specify formally (e.g., helpfulness, harmlessness, or factual accuracy).

The RLHF pipeline typically consists of three stages: initial model training or selection, collection of human preference comparisons, and optimization of a reward model that predicts human preferences. The reward model—itself a learned component—then guides the base model through reinforcement learning algorithms such as Proximal Policy Optimization (PPO) to maximize cumulative preference scores while remaining close to the original model distribution.

RLHF has become a standard technique in post-training of large language models, enabling practitioners to customize model behavior toward specific domains or organizational values without requiring full model retraining. The approach assumes that human annotators can reliably rank or score model outputs, and that this signal generalizes to held-out test distributions.

How it works

RLHF operates in three primary phases:

Phase 1: Supervised Fine-Tuning (SFT) A pre-trained foundation model is fine-tuned on a curated dataset of high-quality input-output pairs, typically generated by human experts. This initial step establishes a baseline behavior that is generally aligned but not fully optimized for human preferences.

Phase 2: Reward Model Training Human annotators compare pairs or sets of model outputs for identical prompts and rank them by preference (e.g., "Response A is better than Response B"). These preference judgments are used to train a separate reward model—usually a fine-tuned language model with a scalar output head—to predict which outputs humans prefer. The reward model learns to assign higher scores to outputs judged as better by annotators.

Phase 3: Reinforcement Learning Optimization The base model is optimized using a reinforcement learning algorithm (commonly Proximal Policy Optimization) with the reward model as the reward signal. During this phase, the model generates outputs in response to prompts, receives reward scores from the reward model, and its parameters are updated to maximize expected reward while maintaining distributional similarity to the SFT model (enforced via a Kullback-Leibler divergence penalty). This prevents the model from drifting too far from learned language patterns or exploiting reward model weaknesses.

The entire process may iterate: collecting additional human feedback on newly improved outputs, retraining the reward model, and further optimizing the base model.

Distinction from related terms

Term	Distinction
Prompt engineering	Prompt engineering modifies input instructions without changing model parameters; RLHF permanently updates model weights using human preference feedback.
Supervised fine-tuning	Supervised fine-tuning uses labeled input-output pairs to teach specific behaviors; RLHF uses comparative human judgments and learned reward functions to optimize preferences that may lack discrete labels.
In-context learning	In-context learning provides examples within a prompt to guide behavior at inference time; RLHF modifies underlying model weights to internalize preferences across all prompts.
Chain-of-thought	Chain-of-thought is a prompting technique to elicit intermediate reasoning; RLHF is a training methodology that can reinforce behaviors like reasoning when humans prefer them.
Instruction-based tuning (general)	Instruction tuning teaches a model to follow diverse instructions using supervised pairs; RLHF uses human-ranked comparisons and reward models to optimize for nuanced preference signals beyond literal instruction-following.

Examples

InstructGPT / GPT-3.5 Training ^[1] OpenAI's InstructGPT applied RLHF to fine-tune GPT-3, collecting human preference judgments from annotators comparing model outputs on diverse tasks. A reward model trained on these comparisons guided PPO optimization, producing a model more aligned with user intent and less inclined to generate harmful content. This work demonstrated measurable improvements in human evaluator preference over the base model.

Constitutional AI and Anthropic's Claude ^[2] Anthropic extended RLHF with Constitutional AI (CAI), using a set of behavioral principles (a "constitution") to evaluate model outputs. Rather than relying exclusively on human annotations, a language model itself was instructed to score outputs against constitutional principles, reducing annotation cost while maintaining preference signals. This hybrid approach was applied to models in the Claude family to improve helpfulness and reduce harmful outputs.

Meta's Llama 2 Post-Training Llama 2's post-training pipeline incorporated RLHF with reward models trained on human preference data. Meta published model cards documenting the training procedure, including annotation guidelines and the composition of the reward dataset, providing transparency into alignment methodology used for a widely-deployed open-source model.

References

↑ Christiano, Paul et al. "Deep reinforcement learning from human preferences." Neural Information Processing Systems (NeurIPS), 2017.
↑ Bai, Yuntao et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073, 2022.

[christiano-1] Christiano, Paul et al. "Deep reinforcement learning from human preferences." Neural Information Processing Systems (NeurIPS), 2017.

[bai-2] Bai, Yuntao et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073, 2022.

[1]

[2]

Anonymous

Search

RLHF

Namespaces

More

Page actions

Contents

Overview

How it works

Distinction from related terms

Examples

See also

References

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

RLHF

Overview

How it works

Distinction from related terms

Examples

See also

References

Navigation

Wiki tools

Page tools

Categories