Reinforcement Learning from Human Feedback

AI concept developed

2 sources

Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is the dominant technique for aligning large language models with human preferences. Introduced by OpenAI in the InstructGPT paper (2022), RLHF showed that a small model fine-tuned with human feedback is preferred over a much larger unaligned model — making alignment, not just scale, the path to useful AI systems.

The Problem RLHF Solves

Language models pretrained on internet text learn to predict the next token — not to be helpful, truthful, or safe. The pretraining objective is fundamentally different from “follow instructions and be useful.” A model that perfectly predicts internet text will happily generate toxic content, fabricate information, or ignore the user’s actual request, because the internet contains all of those patterns. RLHF bridges this gap by training the model to produce outputs that humans actually prefer.

The Three-Step Pipeline

The standard RLHF pipeline, established by the InstructGPT paper:

Step 1: Supervised Fine-Tuning (SFT)

Collect demonstrations of desired behavior from human annotators. Given a prompt, a human writes the ideal response. Fine-tune the pretrained model on this dataset using standard supervised learning. This gives the model a rough sense of the target behavior.

Step 2: Reward Modeling

Collect comparison data: given a prompt, generate multiple model outputs and have humans rank them from best to worst. Train a reward model (a separate neural network) to predict which outputs humans prefer. This converts human judgment into a differentiable signal that can be optimized.

Step 3: RL Optimization

Use the reward model as a training signal. Generate outputs from the SFT model, score them with the reward model, and update the model’s weights to produce higher-scoring outputs. The standard algorithm is Proximal Policy Optimization (PPO), which keeps the model close to the SFT baseline to prevent it from “gaming” the reward model.

Key Results

The InstructGPT findings that established RLHF:

Alignment beats scale. The 1.3B InstructGPT was preferred over the 175B GPT-3 by human evaluators, despite having 100x fewer parameters.
Low alignment tax. RLHF did not significantly degrade performance on standard NLP benchmarks, suggesting alignment and capability are not fundamentally at odds.
Improved safety. InstructGPT showed reductions in toxic output and improvements in truthfulness.

Beyond Alignment: RL for Capabilities

DeepSeek-R1 extended the use of RL beyond alignment into capability development. By training with Group Relative Policy Optimization (GRPO) using only rule-based rewards (accuracy + format, no human labels), the model developed emergent reasoning behaviors — self-reflection, verification, backtracking — that were never explicitly taught. This showed that RL can unlock capabilities that supervised learning alone cannot reach.

The key difference from standard RLHF: DeepSeek-R1 deliberately avoided neural reward models (which are susceptible to reward hacking) and instead used verifiable rewards — math problems have correct answers, code has test suites. This limits the approach to domains with objective evaluation but eliminates the need for human labeling.

Challenges

Reward hacking. Models can learn to produce outputs that score high on the reward model without actually being better. Optimizing too aggressively against an imperfect proxy leads to degenerate behavior.
Labeler disagreement. Human preferences are subjective and inconsistent. Different labelers may rank the same outputs differently, introducing noise into the reward model.
Distribution shift. The reward model is trained on a specific distribution of prompts and outputs. As the policy model shifts during RL, it may encounter regions where the reward model’s predictions are unreliable.
Cost and speed. RLHF is expensive — it requires human annotators, reward model training, and iterative RL optimization. Constitutional AI and direct preference optimization (DPO) are alternatives that reduce the human labeling requirement.

Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback

The Problem RLHF Solves

The Three-Step Pipeline

Step 1: Supervised Fine-Tuning (SFT)

Step 2: Reward Modeling

Step 3: RL Optimization

Key Results

Beyond Alignment: RL for Capabilities

Challenges

Sources

Pages that link here

Related pages