Vision Transformer (ViT): An Image is Worth 16x16 Words

Quick answer

Vision Transformer (ViT) takes a near-unmodified language Transformer, feeds it a sequence of flattened 16x16 image patches instead of words, and shows the result beats state-of-the-art CNNs on image classification — but only after pre-training on a very large dataset. The best model, ViT-H/14 pre-trained on Google’s internal JFT-300M (300 million images), reaches 88.55% top-1 on ImageNet, and does so using substantially less pre-training compute than the convolutional baselines it matches. The catch is in that sentence: the headline only holds with hundreds of millions of pre-training images.

Treating an image as a sequence of patches

A Transformer expects a sequence of tokens. ViT’s whole trick is making an image look like one. It splits a 224x224 image into a grid of fixed 16x16 patches (196 patches), flattens each patch, and projects it through a single linear layer into a token embedding. A learnable position embedding is added so the model knows where each patch sat in the grid, and a special learnable [class] token is prepended whose final state feeds the classifier. From there it is an ordinary Transformer encoder — multi-head self-attention and MLP blocks, no convolutions.

That is the conceptual payload: there is almost no vision-specific machinery left. ViT deliberately strips out the inductive biases CNNs bake in — locality, translation equivariance, a 2D neighborhood structure — keeping only a weak hint of 2D layout through the position embeddings. Everything else the model must learn from data.

Why ViT needs so much data

Removing those biases is exactly why ViT is data-hungry. The paper makes the trade-off explicit and quantified: when pre-trained only on ImageNet (~1.3M images), ViT underperforms comparable ResNets. Pre-train on the larger ImageNet-21k, and it pulls roughly even. Pre-train on JFT-300M, and ViT overtakes the CNNs. The crossover is the real finding — large-scale pretraining doesn’t just help ViT, it substitutes for the architectural priors CNNs get for free. Below that data threshold a ResNet’s built-in locality is a genuine advantage; above it, that prior becomes a constraint the Transformer is free of.

Key results

ViT-H/14 (JFT-300M pretraining): 88.55% top-1 on ImageNet, 99.50% on CIFAR-10, 94.55% on CIFAR-100, and 77.63% on the 19-task VTAB suite — at or above the best CNNs (BiT-L) of the time.
Compute efficiency: ViT reaches these numbers using substantially fewer TPU-core-days to pre-train than the BiT and Noisy Student convolutional baselines it matches or beats. Attention scales better with model and data size here than the convolutional alternatives.
The data crossover: on ImageNet-only pretraining ViT trails ResNets; the ranking flips only as the pre-training set grows to 21k classes and then 300M images.
What attention learns: the paper shows ViT’s attention spans the whole image even in early layers, and learned position embeddings recover a 2D grid structure — the model rediscovers spatial locality rather than having it hardwired.

Limits and open questions

The honest read is that ViT’s 2020 headline is a statement about scale, not a free lunch. The best results depend on JFT-300M, a proprietary Google dataset most teams cannot use, which makes the flagship numbers hard to reproduce externally. In the data regimes most practitioners actually have, a CNN is still the stronger and cheaper choice — strong augmentation and regularization narrow the gap but do not erase it. Plain ViT also handles a single resolution awkwardly: changing input size breaks the fixed position embeddings and needs interpolation, and the flat (non-hierarchical) patch grid is a poor fit for dense tasks like detection and segmentation, which later designs (DeiT for data efficiency, Swin for hierarchy) were built to fix. ViT proved the architecture works; it did not make it the default for everyone overnight.

Why it still matters

ViT’s lasting contribution is conceptual: once an image is a sequence of tokens, the entire language-model toolkit — scaling laws, self-supervised pretraining, masked prediction, multimodal alignment — transfers to vision with little friction. The vision-language models and image foundation models that followed largely assume a patch-token backbone. ViT is the paper that made that assumption safe.

FAQ

What does “16x16 words” mean in the Vision Transformer paper?

It means ViT chops an image into 16x16-pixel patches and treats each patch as one token — the visual equivalent of a word. A 224x224 image becomes a 196-token sequence fed to a standard Transformer.

Why does Vision Transformer (ViT) need so much pretraining data?

Because ViT removes the locality and translation-equivariance biases that CNNs have built in. With little data (ImageNet alone) it underperforms ResNets; only at JFT-300M scale do those missing priors get learned from data and ViT pulls ahead.

How does Vision Transformer (ViT) compare to CNNs on ImageNet?

ViT-H/14 pre-trained on JFT-300M reaches 88.55% top-1 on ImageNet, matching or beating the best CNNs of its era (BiT-L) while using substantially less pre-training compute.

Is ViT better than a CNN for my own dataset?

Usually not, if your dataset is small. ViT’s advantage shows up at very large pre-training scale; in typical limited-data settings a CNN — or a data-efficient ViT variant like DeiT — is the safer choice.

ViT’s one-line lesson: with enough data, a plain Transformer on raw image patches beats hand-built convolutional priors — scale buys what architecture used to. Read the original: https://arxiv.org/abs/2010.11929