InstructGPT
InstructGPT
InstructGPT is a family of language models developed by OpenAI, published in 2022, that demonstrated Reinforcement Learning from Human Feedback (RLHF) as a practical method for aligning language models with human intent. The headline result: the 1.3B-parameter InstructGPT was preferred by human evaluators over the 175B GPT-3, despite having 100x fewer parameters — proving that alignment, not just scale, determines model usefulness.
Significance
- Established the RLHF pipeline. The three-step process (supervised fine-tuning → reward modeling → PPO optimization) became the standard template for training commercial LLMs. ChatGPT, Claude, and Gemini all use variations of this pipeline.
- Small aligned > large unaligned. The finding that a tiny aligned model beats a massive unaligned one meant deploying helpful AI assistants was economically viable.
- Identified reward hacking. Early documentation of the risk that models optimize for high reward model scores without actually producing better outputs — a concern that remains central to alignment research.
- Precursor to ChatGPT. InstructGPT’s instruction-following capabilities, refined through further RLHF iterations, became ChatGPT — the product that brought LLMs to mainstream awareness.