DeepSeek-R1

AI entity developing
1 source

DeepSeek-R1

DeepSeek-R1 is a reasoning-focused language model from DeepSeek-AI, published in January 2026. Its primary contribution was demonstrating that LLM reasoning capabilities can emerge from pure reinforcement learning without human-written reasoning demonstrations. The model achieved 86.7% on AIME 2024 (surpassing the average human competitor) and strong performance across math, coding, and STEM benchmarks.

Significance

  • Reasoning from RL alone. DeepSeek-R1-Zero, trained with GRPO using only rule-based accuracy and format rewards, autonomously developed chain-of-thought reasoning, self-verification, and error correction — behaviors never explicitly taught.
  • The “aha moment.” During training, the model exhibited a sudden shift in reasoning patterns — increased use of “wait” during reflections, marking emergent self-correction. This was an observable transition in the model’s problem-solving strategy.
  • Rule-based rewards avoid reward hacking. By using verifiable correctness signals (math answers, code test suites) instead of neural reward models, DeepSeek-R1 avoided the reward hacking risks inherent in standard RLHF.
  • Distillable reasoning. The reasoning capabilities were successfully distilled into smaller models (1.5B to 70B), with distilled models outperforming their instruction-tuned counterparts.
  • Extended RL beyond alignment. Where InstructGPT used RL to align models with preferences, DeepSeek-R1 showed RL can develop fundamentally new capabilities — not just make models more polite, but make them better reasoners.

Training Pipeline

  1. DeepSeek-R1-Zero — pure RL (GRPO) on DeepSeek-V3-Base with accuracy + format rewards
  2. Cold-start SFT — fine-tune on human-aligned reasoning examples to fix readability and language mixing
  3. RL with mixed rewards — rule-based (accuracy, format) + learned (consistency, preference) rewards
  4. Final SFT — incorporate reasoning and non-reasoning data for well-rounded capability

Sources