DeepSeek-R1
DeepSeek-R1
DeepSeek-R1 is a reasoning-focused language model from DeepSeek-AI, published in January 2026. Its primary contribution was demonstrating that LLM reasoning capabilities can emerge from pure reinforcement learning without human-written reasoning demonstrations. The model achieved 86.7% on AIME 2024 (surpassing the average human competitor) and strong performance across math, coding, and STEM benchmarks.
Significance
- Reasoning from RL alone. DeepSeek-R1-Zero, trained with GRPO using only rule-based accuracy and format rewards, autonomously developed chain-of-thought reasoning, self-verification, and error correction — behaviors never explicitly taught.
- The “aha moment.” During training, the model exhibited a sudden shift in reasoning patterns — increased use of “wait” during reflections, marking emergent self-correction. This was an observable transition in the model’s problem-solving strategy.
- Rule-based rewards avoid reward hacking. By using verifiable correctness signals (math answers, code test suites) instead of neural reward models, DeepSeek-R1 avoided the reward hacking risks inherent in standard RLHF.
- Distillable reasoning. The reasoning capabilities were successfully distilled into smaller models (1.5B to 70B), with distilled models outperforming their instruction-tuned counterparts.
- Extended RL beyond alignment. Where InstructGPT used RL to align models with preferences, DeepSeek-R1 showed RL can develop fundamentally new capabilities — not just make models more polite, but make them better reasoners.
Training Pipeline
- DeepSeek-R1-Zero — pure RL (GRPO) on DeepSeek-V3-Base with accuracy + format rewards
- Cold-start SFT — fine-tune on human-aligned reasoning examples to fix readability and language mixing
- RL with mixed rewards — rule-based (accuracy, format) + learned (consistency, preference) rewards
- Final SFT — incorporate reasoning and non-reasoning data for well-rounded capability