Language Models · Transformers
PaLM: Scaling a 540B Dense Language Model with Pathways
PaLM is a 540-billion-parameter dense Transformer trained on 6,144 TPU v4 chips with Pathways. It hit breakthrough few-shot results and beat average human scores on BIG-bench.
Quick answer
PaLM is a 540-billion-parameter, densely activated, decoder-only Transformer that Google trained on 6,144 TPU v4 chips using its new Pathways system. Its headline claim is that scale keeps paying off: PaLM 540B set state-of-the-art few-shot results across hundreds of language benchmarks, beat the finetuned prior best on a suite of multi-step reasoning tasks, and outperformed average human performance on the aggregate BIG-bench score. The most interesting finding is not the size — it is that a significant number of BIG-bench tasks improved discontinuously as the model scaled, jumping sharply at the largest size rather than improving smoothly.
Training across two TPU v4 Pods with Pathways
The engineering story is Pathways, an ML system built to train one model efficiently across multiple TPU Pods. PaLM ran data-parallel across two TPU v4 Pods connected over data-center network, with model and data parallelism inside each Pod — 6,144 chips total, one of the largest TPU configurations described for a single model at the time. That matters because the hard part of a 540B dense model is no longer the architecture: PaLM is a fairly standard decoder-only Transformer. The contribution is making a model that size train at high hardware utilization without splitting it into a mixture-of-experts. PaLM was, in effect, an argument that careful systems work could keep dense scaling alive while sparse MoE models were absorbing attention.
When scale unlocks reasoning
PaLM’s reasoning results are the part worth dwelling on. Combined with chain-of-thought prompting — asking the model to write out intermediate steps — PaLM 540B matched or beat the finetuned state-of-the-art on multi-step arithmetic and commonsense reasoning benchmarks using only few-shot prompts. That is a meaningful shift: tasks that previously required task-specific finetuning became reachable by prompting a sufficiently large general model. The paper frames this as evidence that some capabilities are emergent with scale rather than smoothly interpolated, which is exactly what the discontinuous BIG-bench curves suggest.
Key results
- 540B parameters, dense, decoder-only, trained on 6,144 TPU v4 chips via Pathways across two TPU v4 Pods.
- State-of-the-art few-shot performance on hundreds of language understanding and generation benchmarks.
- Breakthrough multi-step reasoning: with chain-of-thought prompting, PaLM 540B outperformed the finetuned state-of-the-art on a suite of reasoning tasks using few-shot prompting alone.
- Beat average human performance on the aggregate BIG-bench benchmark.
- Discontinuous scaling: a significant number of BIG-bench tasks improved sharply only at the largest model size, not gradually.
- Strong multilingual and source-code generation results across a wide range of benchmarks.
- A comprehensive bias and toxicity analysis, plus a study of how training-data memorization grows with model scale.
Limits and open questions
PaLM is expensive in the way that defines the era: a dense 540B model is costly to train and costly to serve, and the paper does not pretend scale solves everything. Chain-of-thought reasoning is impressive on benchmarks but is not proof of robust, reliable reasoning — the same model still confabulates and fails on adversarial variants. The memorization analysis is honest about a real cost: larger models memorize more training data, which carries privacy and copyright implications. And the headline “beats average human performance” on BIG-bench is an aggregate over many tasks; it does not mean PaLM beats humans on the hard ones. The deeper open question PaLM raised, and did not answer, is why certain capabilities appear discontinuously — emergence remains observed rather than explained.
FAQ
What is PaLM and how big is it?
PaLM (Pathways Language Model) is a 540-billion-parameter, densely activated, decoder-only Transformer from Google Research. “Dense” means all parameters are used for every token, unlike sparse mixture-of-experts models.
How was PaLM trained?
PaLM was trained on 6,144 TPU v4 chips using Pathways, an ML system that runs efficient training across multiple TPU Pods. PaLM used two TPU v4 Pods connected over the data-center network.
Why is PaLM important for reasoning?
With chain-of-thought prompting, PaLM 540B matched or beat the finetuned state-of-the-art on multi-step reasoning and arithmetic tasks using only few-shot prompts — showing that prompting a large enough general model could replace task-specific finetuning.
What does “discontinuous improvement with scale” mean in PaLM?
On a significant number of BIG-bench tasks, PaLM’s accuracy jumped sharply only at the largest model size rather than rising smoothly — evidence that some capabilities emerge abruptly with scale.
Does PaLM beat humans?
PaLM 540B outperformed average human performance on the aggregate BIG-bench score. That is an average across many tasks, not a claim that it beats expert humans or wins every individual task.
PaLM’s lasting lesson is that a frontier model became a distributed-systems project as much as a neural-network design — read the full paper at arXiv:2204.02311.