Language Models · Google Research
BERT pretrains a deep bidirectional Transformer encoder with masked language modeling, then fine-tunes with one extra layer — pushing GLUE to 80.5% and topping 11 NLP tasks.
Language Models · Google DeepMind
DeepMind's Chinchilla shows model size and training tokens should scale equally. A 70B model on ~1.4T tokens beats Gopher 280B, GPT-3 175B, and MT-NLG 530B.
Code Generation · Meta AI
Code Llama continues training Llama 2 on code, reaching up to 67% on HumanEval and 65% on MBPP, the best open scores at its release, with infilling, instruction following, and 100k-token context support.
Self-Supervised Learning · Meta AI
DINOv2 pretrains Vision Transformers with no labels on a curated 142M-image set, then freezes the backbone — a linear probe on top matches or beats OpenCLIP on most image- and pixel-level benchmarks.
Multimodal Models · Google DeepMind
Flamingo bolts trainable cross-attention onto a frozen vision encoder and a frozen language model, then learns new image and video tasks from a handful of in-context examples — no fine-tuning.
Language Models · OpenAI
GPT-3 is a 175B-parameter autoregressive language model that performs translation, QA, and reasoning tasks from a few in-prompt examples, with no gradient updates or task-specific fine-tuning.
Text-to-Image · Google Research
Google's Imagen hit a new COCO FID of 7.27 without training on COCO, and showed that scaling a frozen T5-XXL text encoder lifts fidelity and alignment more than scaling the diffusion model.
Alignment · OpenAI
OpenAI's InstructGPT used human feedback to align GPT-3, and evaluators preferred its 1.3B model over the 175B GPT-3 — more helpful with 100x fewer parameters.
Language Models · Google Research
PaLM is a 540-billion-parameter dense Transformer trained on 6,144 TPU v4 chips with Pathways. It hit breakthrough few-shot results and beat average human scores on BIG-bench.
Segmentation · Meta AI
Meta AI's SAM treats segmentation as a promptable task and ships with SA-1B (1.1B masks on 11M images), letting one model transfer zero-shot to new objects and image distributions.
Language Models · Google Research
T5 reframes every NLP task as text-in, text-out, then runs a systematic sweep over objectives, architectures, data, and scale. The 11B model set state of the art on GLUE, SuperGLUE, and SQuAD.
Vision Foundation Models · Google Research
ViT shows a plain Transformer fed raw 16x16 image patches beats top CNNs once pre-trained on JFT-300M, reaching 88.55% on ImageNet while using far less training compute.
Speech Recognition · OpenAI
OpenAI's Whisper trains a single sequence-to-sequence model on 680,000 hours of web audio. It matches fully supervised systems zero-shot — no fine-tuning — and adds translation and language ID.
Biomolecular Modeling · Google DeepMind
AlphaFold 3 replaces AlphaFold 2's structure module with a diffusion network and predicts whole complexes — proteins with nucleic acids, ligands, ions, and modified residues — in one model.
Theorem Proving · Google DeepMind
AlphaGeometry pairs a language model with a symbolic engine and trains on 100M synthetic theorems, solving 25 of 30 olympiad geometry problems versus 10 for the prior best.
Transformers · Google Research
The 2017 Transformer dropped recurrence and convolution for pure attention, hit 28.4 BLEU on WMT14 EN-DE and 41.8 on EN-FR, and trained in 3.5 days on 8 GPUs. Nearly every modern LLM inherits it.
Multimodal Models · OpenAI
CLIP trains paired image and text encoders on 400 million internet image-text pairs, then matches the original ResNet-50's ImageNet accuracy zero-shot — without using any of its 1.28M labeled examples.
Text-to-Image · OpenAI
DALL·E 2, called unCLIP in the paper, generates a CLIP image embedding from text with a prior, then renders it with a diffusion decoder — buying more diversity at almost no cost to photorealism or caption match.
Alignment · Stanford University
Direct Preference Optimization solves the RLHF problem with a single classification-style loss on preference pairs — no separate reward model, no RL loop, no sampling during training.
Efficient AI · Stanford University
FlashAttention is an exact attention algorithm that uses tiling and recomputation to cut GPU memory traffic, delivering 3x on GPT-2, 15% on BERT-large, and linear memory in sequence length.
Long Context · Google DeepMind
Gemini 1.5 Pro and Flash keep >99% retrieval recall up to at least 10M tokens of text, video, and audio — and Pro matches Gemini 1.0 Ultra with far less compute.
Multimodal Models · OpenAI
OpenAI's GPT-4 report is a measurement document, not a recipe. It hits human-level scores on professional and academic exams — bar exam ~top 10% — yet discloses no architecture, data, or compute.
Open Models · Meta AI
Meta released Llama 3 as a herd of language models led by a dense 405B-parameter flagship with a 128K context window, trained on 15T+ tokens and openly published with weights.
Sequence Modeling · Carnegie Mellon University
Mamba makes state space model parameters depend on the input, so it selectively remembers or forgets tokens. It scales linearly, runs 5x faster than Transformers, and Mamba-3B matches Transformers twice its size.
Diffusion Models · CompVis
Latent diffusion runs denoising inside a pretrained autoencoder's compressed latent space instead of raw pixels, cutting training and inference cost while adding cross-attention conditioning for text and layout.
Vision-Language-Action · Physical Intelligence
π0 bolts a flow-matching action expert onto a pretrained VLM, emitting ~50Hz action chunks so one policy can fold laundry, bus tables, and assemble boxes across single-arm, dual-arm, and mobile robots.
Vision-Language-Action · Google DeepMind
RT-2 co-fine-tunes a web-pretrained vision-language model on robot trajectories, expresses actions as text tokens, and gets emergent generalization to novel objects, unseen commands, and basic reasoning across 6k trials.
Segmentation · Meta AI
SAM 2 carries one click through a whole video using a streaming memory module, hitting better masks with 3x fewer interactions than prior video methods and running 6x faster than SAM on images.
LLM Reasoning · DeepSeek
DeepSeek-R1 learns to reason from reinforcement learning on whether its answer is correct — with no human reasoning examples — matches OpenAI o1 on AIME and MATH-500, and ships open MIT-licensed weights.