Scaling Laws

AI concept developed
3 sources

Scaling Laws

Scaling laws are empirical relationships that predict how language model performance changes as a function of model size (parameters), training data (tokens), and compute (FLOPs). The central finding: performance improves predictably as these factors scale, following power-law relationships. This means that given a compute budget, you can estimate the optimal model size and training duration before starting — and the answer often contradicts intuition.

The Core Insight

The Chinchilla paper (2022) established the most practically important scaling law: for compute-optimal training, model size and training tokens should be scaled roughly equally. A 10x increase in compute budget should produce a model ~5.5x larger trained on ~5.5x more data.

This overturned prior practice. Earlier scaling analysis (Kaplan et al., 2020) suggested increasing model size much faster than data, leading to models like GPT-3 (175B parameters, 300B tokens) that were dramatically undertrained by Chinchilla standards. Chinchilla (70B parameters, 1.4T tokens) used the same compute as the larger Gopher (280B parameters) but uniformly outperformed it — 4x smaller, trained on 4x more data.

Why This Matters

Scaling laws are practically important because they prevent waste:

  • Compute budget planning. Before spending millions of dollars on training, you can estimate the optimal model size. Training a model that’s too large for its data budget (or vice versa) leaves performance on the table.
  • Smaller models serve better. A compute-optimal model is smaller than an undertrained one for the same performance level. Smaller models are faster at inference, cheaper to serve, easier to fine-tune, and simpler to deploy.
  • Data becomes the bottleneck. Scaling data at the same rate as parameters means the industry needs exponentially more training data. This has driven investment in data curation, synthetic data generation, and multimodal training (using images, video, and code in addition to text).

Power Laws in Practice

Performance (measured as cross-entropy loss) follows a smooth power-law relationship with scale:

$$L(N, D) \approx \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E$$

where $N$ is parameters, $D$ is training tokens, and $E$ is an irreducible error term. This relationship holds across many orders of magnitude, making extrapolation reliable. Three independent estimation approaches in the Chinchilla paper converged on the same prediction.

Beyond Training Loss

Scaling laws for training loss are well-established, but the relationship between loss and downstream capabilities is less predictable:

  • Emergent capabilities. Some abilities (like in-context learning and Chain-of-Thought Prompting) appear suddenly at certain scales rather than improving gradually. A model might show no capability at one scale and strong capability at 2x the scale.
  • Benchmark performance is noisy. Individual benchmarks can be sensitive to prompt format, evaluation methodology, and training data overlap. Scaling laws predict average loss much better than specific benchmark scores.
  • Post-training can shift the curve. RLHF and other alignment techniques can dramatically change a model’s effective capabilities at a given scale — the InstructGPT paper showed a 1.3B aligned model outperforming a 175B unaligned one on preference metrics.

Relevance to Atopia Labs Verticals

Understanding scaling laws is essential for any team evaluating or deploying LLMs:

  • Cost estimation. Scaling laws let you estimate whether a task requires a frontier model or whether a smaller model trained on more task-relevant data would suffice.
  • Build vs. buy decisions. Understanding compute-optimal training helps evaluate whether fine-tuning a smaller open model or using a larger API model is more cost-effective for a given use case.
  • Future planning. The predictability of scaling means capability improvements at given compute budgets can be roughly forecasted.

Sources