Source: Training Compute-Optimal Large Language Models
Source: Training Compute-Optimal Large Language Models
Summary
The Chinchilla paper from DeepMind demonstrated that most large language models were significantly undertrained — they used too many parameters relative to their training data. By training over 400 models ranging from 70M to 16B parameters, the authors derived Scaling Laws showing that for compute-optimal training, model size and training tokens should be scaled equally. The practical demonstration: Chinchilla (70B parameters, 1.4T tokens) uniformly outperformed GPT-3-class models (175B parameters, 300B tokens) despite being 4x smaller.
Key Claims
- Current large models are undertrained. For a given compute budget, existing models (Gopher at 280B, GPT-3 at 175B) use too many parameters trained on too few tokens. The optimal allocation requires roughly equal scaling of both.
- Model size and data should scale together. Given a 10x increase in compute budget, the optimal model should be ~5.5x larger and trained on ~5.5x more data. Prior recommendations (Kaplan et al.) suggested increasing model size much faster than data, which was suboptimal.
- Chinchilla proves the prediction. Using the same compute budget as DeepMind’s Gopher (280B), the authors trained Chinchilla with 70B parameters on 1.4T tokens — 4x smaller model, 4x more data. Chinchilla outperformed Gopher on every evaluation, including 67.5% on MMLU (vs. Gopher’s 60%).
- Smaller optimal models are practically better. A compute-optimal model is smaller, which means faster inference, lower serving costs, easier deployment, and lower fine-tuning costs. The benefits extend well beyond training efficiency.
- Three independent estimation approaches converge. The authors used three different methods to estimate optimal model size for a given compute budget. All three predicted that current models should be substantially smaller and trained much longer.
Relevance and Implications
The Chinchilla scaling laws fundamentally changed how the industry trains large language models. After this paper, the focus shifted from “make models bigger” to “train models longer on more data” — a more efficient path to better performance. The Scaling Laws framework gives practitioners a way to predict the optimal model size for any compute budget, preventing waste on oversized, undertrained models. This paper also indirectly motivated work on data quality and curation, since scaling data 4x requires either finding 4x more text or being smarter about data selection.