T5 Explained: One Text-to-Text Interface for Every NLP Task

Quick answer

T5 (Text-to-Text Transfer Transformer) treats classification, translation, summarization, and QA as the same problem: read a text string, write a text string. Google Research used that single interface to run a controlled study of pretraining objectives, architectures, datasets, and scale on dozens of language tasks, then pushed the recipe with an 11-billion-parameter model that reached state-of-the-art on GLUE, SuperGLUE, SQuAD, and several summarization benchmarks. The lasting contribution is not one magic trick but the apples-to-apples comparison the unified format made possible.

Every task as text-to-text

The trick that holds everything together is dead simple: prefix the input with a short task tag and always decode free text. translate English to German: ..., summarize: ..., or cola sentence: ... all go into the same encoder-decoder, and even regression targets like the STS-B similarity score (1 to 5) are emitted as a text string such as 3.8. No task-specific heads, no separate loss functions, no per-task output layers.

That uniformity is what makes the rest of the paper trustworthy. Because the input/output contract never changes, the authors can swap one variable at a time — objective, architecture, corpus, fine-tuning strategy — and attribute the difference to that variable alone. Most prior transfer-learning papers changed several things at once; T5’s value is that it didn’t.

The C4 corpus

To pretrain at scale you need clean data, so the team built the Colossal Clean Crawled Corpus (C4) from Common Crawl. The cleaning is aggressive and opinionated: drop pages without terminal punctuation, throw out lines shorter than a few words, remove anything matching a list of offensive words, deduplicate, and keep only English. The result is roughly 750GB of text — far larger than the curated corpora common at the time, and the cleaning itself measurably beat training on raw Common Crawl.

What the systematic study found

The sweep is the real paper. A few results that survived all the ablations: the encoder-decoder architecture beat decoder-only and prefix-LM variants at matched parameter counts; a BERT-style denoising (span-corruption) objective beat language-modeling and deshuffling objectives; and replacing single dropped tokens with shorter sentinel spans cut sequence length and sped up training without hurting quality. Scale helped along every axis the authors tried — more parameters, more data, more steps — which is exactly the kind of boring, monotone signal that justified building the 11B model.

Key results

The flagship T5-11B set state-of-the-art results across many benchmarks. On SuperGLUE it reached an average of 88.9, closing most of the gap to the human baseline of 89.8 — a striking number for a 2019 system. It also topped the GLUE leaderboard and posted strong exact-match/F1 on SQuAD, plus state-of-the-art numbers on CNN/Daily Mail summarization. The honest framing: most of the headline jump came from scale and C4 on top of an already-strong encoder-decoder, not from a novel objective — the objective study mostly told you which choices not to waste compute on.

The team released the data (C4), pretrained checkpoints across five sizes (Small through 11B), and code, which is a large part of why T5 became a default baseline rather than a one-off result.

Limits and open questions

The unified text-to-text format is elegant but lossy: forcing structured outputs (spans, scores, labels) through free-text decoding can discard structure a task-specific head would have used, and it makes parsing the output an extra failure mode. C4’s cleaning is a blunt instrument — the offensive-word blocklist also removes legitimate content and bakes in English-web bias, and “cleaner” is defined by heuristics, not by any downstream guarantee. The 11B model is expensive to fine-tune and serve, so the SOTA numbers came at a cost most practitioners couldn’t pay in 2019. Finally, T5 predates instruction tuning and RLHF; its task tags are fixed prompts, not the flexible natural-language instructions later models learned to follow.

FAQ

What does T5 stand for?

T5 stands for Text-to-Text Transfer Transformer — five words starting with T. It is an encoder-decoder model from Google Research that casts every NLP task as converting an input text string to an output text string.

How is T5 different from BERT?

BERT is an encoder-only model that produces representations you attach task-specific heads to; T5 is a full encoder-decoder that generates the answer as text. T5 borrows BERT’s masked-denoising idea for pretraining but unifies all downstream tasks under one generative interface instead of per-task classification heads.

What is the C4 dataset used by T5?

C4 (Colossal Clean Crawled Corpus) is a ~750GB English web corpus filtered from Common Crawl, built specifically to pretrain T5. The cleaning removes boilerplate, short or punctuation-less lines, duplicates, and offensive-word matches, and it measurably outperformed pretraining on raw Common Crawl.

Is T5 still worth using?

For many sequence-to-sequence jobs the architecture is still a solid baseline, and instruction-tuned descendants like Flan-T5 remain practical. But T5 predates RLHF and modern instruction following, so for chat-style or general-purpose use you would reach for a more recent model. T5’s enduring value is the methodology, not the checkpoint.

T5’s real legacy is a discipline, not a model: change one thing, measure it, and only then spend the compute on scale. Read the original on arXiv.