Switch Transformer: One Expert Per Token, Up to a Trillion Parameters

Quick answer

Switch Transformer is the paper that made sparse Mixture-of-Experts (MoE) simple enough to actually train at scale: instead of sending each token to several experts, it routes every token to exactly one expert. Google Research used this to reach up to 7x faster pretraining than T5 at the same compute budget on the same data, and to train sparse models up to 1.6 trillion parameters — while keeping the per-token FLOPs constant, because only one expert fires per token.

Routing each token to one expert

A standard Transformer reuses the same feed-forward weights for every token. MoE replaces that single feed-forward layer with many parallel “expert” feed-forward networks plus a router that picks which experts handle each token. The pre-2021 assumption — going back to Shazeer’s 2017 MoE — was that you needed top-k routing with k ≥ 2, sending each token to at least two experts so the router would get a useful gradient and the choice would not collapse.

Switch Transformer’s central bet is that k = 1 is enough. The router computes a softmax over experts and sends the token to the single highest-scoring one, scaling that expert’s output by the router probability so gradients still flow. This “Switch layer” is the whole trick, and it pays off three ways at once: routing computation is halved, the data shuffled between devices roughly halves, and the implementation gets dramatically simpler because each token has one destination instead of a variable handful.

The tricks that made it stable

Sparse models had a reputation for being fragile, and Switch Transformer’s real contribution is the engineering that fixed that:

Selective precision. The instability came largely from the router’s exponentials in bfloat16. The fix is to cast only the router computation to float32 locally, keeping everything else in bfloat16. This let them train large sparse models in low precision for the first time without the communication cost of full float32.
Capacity factor and a load-balancing loss. Each expert has a fixed buffer; tokens beyond it are “dropped” (passed through via the residual). A capacity factor tunes that buffer, and an auxiliary load-balancing loss pushes the router to spread tokens evenly instead of overloading a few favorite experts.
Smaller initialization and expert dropout. Scaling down the weight init and raising dropout inside the experts during fine-tuning kept the much larger models from diverging or overfitting small downstream tasks.

Key results

Up to 7x pretraining speedup over T5-Base and T5-Large at identical compute, measured as wall-clock to reach the same quality.
1.6 trillion parameters. The Switch-C model scales sparse parameters to 1.6T while holding FLOPs-per-token roughly at a T5-Base level.
4x speedup over T5-XXL. A trillion-parameter Switch model pretrained on the Colossal Clean Crawled Corpus (C4) reaches T5-XXL quality about 4x faster.
Gains in all 101 languages. Against mT5-Base on multilingual pretraining, the Switch version improves on every one of the 101 languages — not just the high-resource ones.
Distillation back to dense. A large sparse model can be distilled into a dense one, keeping around 30% of the sparse-to-dense quality gain while shedding the parameter count — useful when serving many experts is impractical.

Why it matters now

Switch Transformer is, in a direct line, the predecessor to today’s MoE LLMs. The “route each token to a small number of experts, balance the load, scale parameters far past your FLOP budget” recipe behind Mixtral, DeepSeek-V3, and Qwen’s MoE models is Switch’s idea, refined. Most modern systems actually walked back to top-2 routing for quality, so Switch’s strongest claim — that k = 1 is sufficient — is the part the field partly rejected. But the framing that won was Switch’s: sparsity as the practical way to grow capacity without growing inference cost. If you want to understand why frontier labs ship trillion-parameter MoE models that cost the same per token as far smaller dense ones, this is the paper that made the case credible.

Limits and open questions

The trillion-parameter headline is a parameter count, not a capability claim — those models match a much smaller dense model’s quality faster, they do not unlock new abilities, and the FLOPs-matched comparison is what’s honest. Sparse MoE also trades compute for memory and bandwidth: you must store and shard every expert, so the 1.6T model needs a large pod of accelerators to hold weights even though each token touches little of it. Token dropping at capacity limits means some tokens get no expert, and fine-tuning these models on small datasets was visibly prone to overfitting and instability. And the k = 1 choice that names the paper is the one later work most often overrode — evidence that single-expert routing was simple and fast, but not always the quality-optimal point.

FAQ

How is Switch Transformer different from a normal Mixture-of-Experts?

Earlier MoE layers routed each token to at least two experts (top-k, k ≥ 2). Switch Transformer routes each token to exactly one expert (k = 1), which halves routing and communication cost and makes the layer much simpler to implement and train.

How many parameters does Switch Transformer have?

The largest model, Switch-C, has 1.6 trillion parameters. Because only one expert activates per token, the compute per token stays close to a T5-Base dense model despite the enormous parameter count.

How much faster is Switch Transformer than T5?

Up to 7x faster pretraining than T5-Base and T5-Large at the same compute, and roughly 4x faster than T5-XXL for the trillion-parameter version to reach comparable quality.

Did Switch Transformer influence models like Mixtral and DeepSeek?

Yes. Switch Transformer popularized sparse MoE for language models and the load-balancing and routing machinery that today’s MoE LLMs — Mixtral, DeepSeek-V3, Qwen MoE — build on, though most of them returned to top-2 routing rather than Switch’s single-expert choice.

One line: route each token to one expert, balance the load, and you can scale parameters to a trillion without paying for them at inference. Read the original paper on arXiv.