Language Models · Transformers

Chinchilla: Why Compute-Optimal LLMs Beat Bigger Ones

DeepMind's Chinchilla shows model size and training tokens should scale equally. A 70B model on ~1.4T tokens beats Gopher 280B, GPT-3 175B, and MT-NLG 530B.

Chinchilla: Why Compute-Optimal LLMs Beat Bigger Ones

Quick answer

For a fixed compute budget, you should scale model parameters and training tokens equally — double the model, double the data. DeepMind tested this by training Chinchilla, a 70B-parameter model on roughly 1.4 trillion tokens (about 4x Gopher’s data) using the same compute as the 280B Gopher. Chinchilla uniformly beats Gopher and reaches a state-of-the-art 67.5% average on MMLU, a 7+ point jump. The headline lesson: most 2021-era frontier models were not too small — they were badly undertrained.

The scaling-law mistake everyone was making

The previous generation of scaling work (notably the 2020 Kaplan et al. laws that guided GPT-3) concluded that when you grow your compute budget, you should pour most of it into more parameters and only modestly increase data. That advice produced a race to ever-bigger models — Gopher at 280B, GPT-3 at 175B, MT-NLG at 530B — trained on roughly 300 billion tokens regardless of size.

Chinchilla’s authors re-ran the experiment more carefully: over 400 models from 70M to 16B+ parameters, trained on 5B to 500B tokens, fitting the loss surface with three independent methods. All three agreed and contradicted the earlier conclusion. Parameters and tokens carry roughly equal exponents, so they should grow in lockstep. By that math, a 530B model trained “compute-optimally” would need on the order of 10 trillion tokens — far more than anyone was feeding these giants.

Equal scaling of data and parameters

The practical rule is blunt: for every doubling of model size, double the number of training tokens. The optimum is not a knife-edge — it is a basin — but the 2020 recipe sat well outside it on the under-data side.

Chinchilla is the existence proof. Same FLOPs as Gopher, but reallocated: 4x smaller model, 4x more data. The reallocation is the whole contribution. There is no architecture trick, no new objective, no exotic data pipeline — just a corrected budget split, which is exactly why the result was so uncomfortable for labs sitting on huge undertrained checkpoints.

Key results

  • Chinchilla = 70B params, ~1.4T training tokens, same compute budget as the 280B Gopher.
  • MMLU: 67.5% average accuracy, a state-of-the-art result and a 7+ point improvement over Gopher.
  • Uniformly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) across a large suite of downstream tasks — despite being 2.5x to 7.5x smaller.
  • Because it is small, Chinchilla costs substantially less to fine-tune and to serve, so the efficiency win compounds in production, not just at training time.

Limits and open questions

Scaling laws are empirical curve fits, not physics. The exponents depend on data quality, tokenizer, architecture, and optimizer schedule — the original “Chinchilla optimal” coefficients have since been re-derived and partly disputed by replication work, and a later erratum adjusted some fits. The laws also optimize pretraining loss for a single training run, which is the wrong objective if you plan to serve a model billions of times: in that regime you deliberately overtrain a smaller model past its compute-optimal point (the LLaMA argument) to cut inference cost.

Chinchilla also says nothing about instruction tuning, alignment, tool use, or long-context behavior — it is a pretraining-economics result, full stop. And the practical ceiling has shifted: when high-quality tokens run scarce, “just double the data” stops being free advice.

FAQ

What is the Chinchilla scaling law in one sentence?

For compute-optimal training, model size and training tokens should scale equally — every doubling of parameters calls for doubling the data.

How big is Chinchilla and how much data did it see?

Chinchilla has 70 billion parameters and was trained on roughly 1.4 trillion tokens, using the same compute budget as the 280B-parameter Gopher.

Did Chinchilla actually beat GPT-3 and Gopher?

Yes. Chinchilla uniformly and significantly outperformed Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and MT-NLG (530B) on a broad downstream suite, and scored 67.5% on MMLU versus Gopher’s lower mark.

Is the Chinchilla recipe still the right way to train LLMs?

Not always. Chinchilla minimizes training cost for a fixed budget, but if a model will be served at huge scale, teams now overtrain smaller models past the Chinchilla point to save on inference — and high-quality data scarcity complicates the “double the data” rule.

Chinchilla’s real legacy is one corrected number: data is a first-class scaling variable, and a well-fed 70B can humble a starved 530B. Read the original on arXiv.