Language Models · Transformers

BERT Explained: Bidirectional Transformer Pretraining for NLP

BERT pretrains a deep bidirectional Transformer encoder with masked language modeling, then fine-tunes with one extra layer — pushing GLUE to 80.5% and topping 11 NLP tasks.

BERT Explained: Bidirectional Transformer Pretraining for NLP

Quick answer

BERT is a deep bidirectional Transformer encoder, pretrained on unlabeled text with masked language modeling, that you fine-tune into a task-specific model by adding a single output layer. In the 2018 paper it set new state of the art on 11 NLP tasks at once: GLUE 80.5% (+7.7 absolute), MultiNLI 86.7% (+4.6), SQuAD v1.1 test F1 93.2 (+1.5), and SQuAD v2.0 test F1 83.1 (+5.1). The lasting contribution is not any single score but the recipe — pretrain once, fine-tune everywhere.

Why left-to-right models were limited

Earlier language models read text in one direction. To represent the word “bank” in “I went to the bank,” a left-to-right model only sees “I went to the” — never the words that disambiguate it. Some systems concatenated a separately trained left-to-right and right-to-left model, but that is a shallow patch: the two halves never condition on each other inside the same layers. For tasks like question answering, where the answer span depends on words on both sides, this one-directional habit left real signal on the table. BERT’s whole argument is that representations should be deeply bidirectional from the first layer up.

Masked language modeling

You can’t just let a Transformer attend in both directions while predicting the next token — each word would indirectly “see itself.” BERT’s fix is the cleverest part of the paper: randomly mask 15% of input tokens and train the model to recover them from the surrounding context on both sides. Because the target token is hidden, bidirectional attention is now safe. A second objective, next sentence prediction, trains the model to judge whether one sentence actually follows another, intended to help sentence-pair tasks like inference and QA.

Both objectives run on plain unlabeled text — BooksCorpus plus English Wikipedia — so pretraining needs no human annotation. The result is a general-purpose encoder; the downstream task only adds a thin classifier or span predictor on top.

Key results

The headline is breadth, not a single win. BERT-large reported GLUE 80.5%, a 7.7-point absolute jump over the prior best, plus MultiNLI 86.7%, SQuAD v1.1 test F1 93.2, and SQuAD v2.0 test F1 83.1 — 11 tasks total, with no per-task architecture surgery. Ablations in the paper matter as much as the scores: drop the masked objective for a left-to-right one and accuracy falls sharply, which is the cleanest evidence that bidirectionality, not just scale, drove the gains.

My honest read: the next-sentence-prediction objective was the weakest link, and later work (notably RoBERTa) showed you could remove it and train longer for better results. The durable idea is masked pretraining plus fine-tuning.

Why it matters now

BERT turned NLP from “design a model per task” into “pretrain a backbone, adapt it cheaply.” That template now underpins search ranking, question answering, biomedical and legal text mining, and a whole family of encoders (RoBERTa, ALBERT, DeBERTa, domain variants). It also made the pretraining objective itself a first-class research question — what should a model learn before it ever sees a label?

Limits and open questions

BERT is an encoder, not a generator: it classifies, ranks, and extracts, but it does not write fluent long-form answers, which is why the field later pivoted to decoder-style LLMs for generation. Fine-tuning can be unstable on small datasets, and the pretrain–fine-tune cycle is compute-heavy. The mask token also creates a train/inference mismatch — masks appear in pretraining but never at fine-tuning time. If you need generation or few-shot prompting, skip BERT; if you need a strong, cheap-to-fine-tune understanding model, it and its descendants are still hard to beat.

FAQ

What is BERT in one sentence?

BERT (Bidirectional Encoder Representations from Transformers) is a Transformer encoder pretrained on unlabeled text with masked language modeling, designed to be fine-tuned into state-of-the-art models for understanding tasks like classification, inference, and question answering.

How is BERT different from GPT?

GPT is a left-to-right decoder built to generate text; BERT is a bidirectional encoder built to understand text. BERT sees both left and right context at every layer, which helps classification and span-extraction tasks, but it does not produce free-form text the way GPT does.

Why does BERT use masked language modeling?

Masking lets BERT train bidirectionally without a token leaking its own identity. By hiding 15% of tokens and predicting them from both-side context, the model learns deep representations that condition on the full sentence, not just a prefix.

Is BERT still relevant in 2026?

Yes for understanding tasks. Generative LLMs replaced BERT for chat and writing, but BERT-family encoders remain widely used for retrieval, ranking, and classification because they are cheap to fine-tune and fast to serve.

One line: BERT proved that deep bidirectional pretraining, not bigger task-specific models, was the unlock for language understanding.

Source: arXiv:1810.04805