Transformers · Sequence Modeling

Attention Is All You Need: The Transformer Architecture Explained

The 2017 Transformer dropped recurrence and convolution for pure attention, hit 28.4 BLEU on WMT14 EN-DE and 41.8 on EN-FR, and trained in 3.5 days on 8 GPUs. Nearly every modern LLM inherits it.

Attention Is All You Need: The Transformer Architecture Explained

Quick answer

The Transformer is a sequence model built entirely from attention — no recurrence, no convolution. It scored 28.4 BLEU on WMT 2014 English-German (beating the prior best, including ensembles, by more than 2 BLEU) and set a new single-model record of 41.8 BLEU on English-French, trained in 3.5 days on eight GPUs. The headline result is not the BLEU score; it is that removing recurrence made training fully parallel, and that the same architecture later scaled into GPT, BERT, and almost every large model since.

Why recurrence was the bottleneck

Before 2017, the strongest translation models were recurrent (LSTM/GRU) encoder-decoders, often with an attention bridge between them. The problem was structural: an RNN computes step t only after step t-1, so a 50-token sentence forces 50 sequential operations that a GPU cannot parallelize. Long-range dependencies also had to survive being squeezed through a chain of hidden states, where early signal decays. Convolutional sequence models parallelized better but still needed many stacked layers to connect distant positions. The paper’s bet was that attention alone — already used as a helper — could carry the whole load and erase the sequential dependency.

How self-attention works

Self-attention lets every token look directly at every other token in one step. Each token emits a query, a key, and a value; the query is dotted against all keys, scaled by 1/√d, softmaxed into weights, and used to take a weighted sum of values. The path length between any two tokens is therefore constant, not linear in distance — which is exactly what the RNN could not offer.

Two design choices make it work in practice. Multi-head attention runs several attention computations in parallel on lower-dimensional projections, so the model can track different relations (syntactic, positional, coreference) at once. Positional encodings — fixed sinusoids added to the embeddings — re-inject word order, since attention by itself is permutation-invariant. The full model is a 6-layer encoder and 6-layer decoder, each layer pairing attention with a position-wise feed-forward network, residual connections, and layer normalization.

Key results

  • WMT 2014 English-German: 28.4 BLEU, over 2 BLEU above the previous best, including ensembles.
  • WMT 2014 English-French: 41.8 BLEU, a new single-model state of the art.
  • Training cost: the big model trained in 3.5 days on eight P100 GPUs — a fraction of the compute the competing models reported.
  • Generalization: applied to English constituency parsing, it stayed competitive with task-specific systems in both large- and limited-data regimes, showing the architecture was not translation-specific.

The efficiency number is the underrated one. Matching or beating heavily engineered ensembles at a small fraction of the training cost is what made the architecture worth scaling up rather than just publishing.

Limits and open questions

Self-attention is O(n²) in sequence length: every token attends to every other, so cost grows quadratically and long documents get expensive fast. The entire follow-on literature on sparse, linear, and FlashAttention-style methods exists to chip away at this, and none of them fully replaced the original. The paper also proves nothing about reasoning, factuality, grounding, or alignment — those are properties of scale, data, and training objectives layered on top, not of the architecture itself. And the sinusoidal positional scheme was quickly superseded (learned, then rotary, embeddings), a reminder that the specific recipe aged faster than the core idea.

FAQ

What is the main contribution of Attention Is All You Need?

It introduced the Transformer, the first competitive sequence-transduction model with no recurrence or convolution — attention does all the sequence mixing. That made training parallel and the architecture easy to scale.

Why is the Transformer faster to train than an RNN?

An RNN must process tokens in order, one dependent on the last. The Transformer computes all token interactions in parallel within a layer, so a full sequence runs in one matrix step instead of n sequential ones — which is why it trained in 3.5 days on 8 GPUs.

Does Attention Is All You Need cover large language models?

No. The 2017 paper is about machine translation and parsing. GPT, BERT, and later LLMs reuse the Transformer block but add scale, pretraining objectives, and far larger data — none of which this paper studies.

What is the biggest weakness of the original Transformer?

Quadratic attention cost in sequence length, which makes long contexts expensive. Most efficiency research since 2017 targets this, but the standard attention mechanism remains the baseline everyone compares against.

The Transformer’s real claim wasn’t a BLEU score — it was proving attention alone could replace recurrence, and that bet now underpins the field. Read it at the source on arXiv.