Source: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Source: DeepSeek-R1
Summary
DeepSeek-R1 demonstrated that LLM reasoning capabilities can be developed through pure reinforcement learning, without relying on supervised fine-tuning on human-written reasoning traces. Using Group Relative Policy Optimization (GRPO) with only rule-based rewards (correctness + format), the model autonomously developed sophisticated reasoning behaviors — self-reflection, verification, backtracking, and strategy adaptation — that emerged purely from the incentive to produce correct answers.
Key Claims
- RL alone produces reasoning. DeepSeek-R1-Zero, trained with RL on DeepSeek-V3-Base without any supervised fine-tuning, developed chain-of-thought reasoning, self-verification, and error correction. These behaviors were not taught — they emerged from the reward signal.
- The model exhibits an “aha moment.” During training, the model showed a sudden shift in reasoning patterns — an increase in the use of “wait” during reflections, marking the emergence of self-correction behavior. The authors describe this as witnessing the model learning to rethink its approach mid-solution.
- Thinking time increases naturally. Response length grew steadily during RL training as the model learned that more reasoning steps improve accuracy. The model autonomously discovered that harder problems benefit from longer deliberation.
- Rule-based rewards avoid reward hacking. The authors deliberately avoided neural reward models, using only accuracy rewards (is the answer correct?) and format rewards (is the reasoning in
<think>tags?). This prevents the reward hacking problems observed in RLHF with learned reward models. - AIME performance jumps from 15.6% to 86.7%. On the AIME 2024 math competition, pass@1 accuracy went from 15.6% to 77.9% through RL training alone. With self-consistency decoding (majority voting over multiple samples), performance reached 86.7%, surpassing the average human competitor.
- Full DeepSeek-R1 adds supervised fine-tuning. To address readability and language mixing issues in R1-Zero, the full R1 pipeline adds cold-start SFT data, followed by RL with both rule-based and preference-based rewards. This produces a model that reasons well and communicates clearly.
- Distillation transfers reasoning to smaller models. The reasoning capabilities were distilled into smaller models (1.5B to 70B parameters), with the distilled models outperforming their instruction-tuned counterparts. This demonstrates that the reasoning patterns can be compressed.
Relevance and Implications
DeepSeek-R1 showed that the next frontier of LLM capability may come from RL-trained reasoning rather than simply scaling pretraining. The finding that complex reasoning behaviors emerge from simple reward signals — without explicit human demonstrations — suggests that RL can unlock capabilities that supervised learning cannot teach. This extends the RLHF paradigm beyond alignment into capability development: rather than just making models helpful and harmless, RL can make them fundamentally more capable reasoners. The distillation results suggest these capabilities can be democratized across model sizes.