Source: Training Language Models to Follow Instructions with Human Feedback

AI source
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida Mar 3, 2022
PublishedMarch 3, 2022
AuthorLong Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida
Sourcearxiv.org

Source: Training Language Models to Follow Instructions with Human Feedback

Summary

The InstructGPT paper from OpenAI introduced Reinforcement Learning from Human Feedback (RLHF) as a practical method for aligning language models with human intent. The core finding: a 1.3B parameter model fine-tuned with human feedback was preferred over the 175B GPT-3 by human evaluators, despite having 100x fewer parameters. Making models bigger doesn’t make them more helpful — alignment does.

Key Claims

  • Language models are misaligned by default. The pretraining objective (predict the next token from internet text) is fundamentally different from “follow the user’s instructions helpfully and safely.” Models trained only on next-token prediction produce outputs that are untruthful, toxic, or simply unhelpful.
  • RLHF aligns models with human preferences. The three-step process: (1) collect demonstrations of desired behavior and fine-tune with supervised learning, (2) collect human rankings of model outputs and train a reward model, (3) optimize the policy against the reward model using PPO (Proximal Policy Optimization).
  • Small aligned models beat large unaligned ones. The 1.3B InstructGPT was preferred over the 175B GPT-3 on the OpenAI API prompt distribution. Alignment is a more efficient path to usefulness than raw scale.
  • Truthfulness and safety improve. InstructGPT showed improvements in truthfulness (TruthfulQA) and reductions in toxic output generation, with minimal performance regression on standard NLP benchmarks.
  • The “alignment tax” is low. Fine-tuning for instruction following did not significantly degrade the model’s capabilities on academic benchmarks, suggesting alignment and capability are not fundamentally at odds.
  • Neural reward models are susceptible to reward hacking. The authors noted that optimizing too aggressively against the reward model can lead to outputs that score high on the reward model but are not actually preferred by humans — an early observation of what became a major research concern.

Relevance and Implications

This paper made RLHF the standard technique for training commercial LLMs. Every major model deployed since — ChatGPT, Claude, Gemini — uses some form of human feedback alignment. The InstructGPT pipeline established the template: pretrain → supervised fine-tuning → reward modeling → RL optimization. The finding that alignment dramatically improves usefulness without requiring larger models was practically important — it meant deploying helpful AI assistants was economically viable. The paper also introduced key concerns (reward hacking, labeler disagreement) that remain active research areas. DeepSeek-R1 later showed that RL can go further, training reasoning capabilities without human demonstrations at all.

Sources