Research Papers — AI research papers, explained.

Latest

AlphaCode Explained: Competition-Level Code Generation

DeepMind's AlphaCode averaged a top 54.3% ranking on Codeforces contests with 5,000+ participants by generating up to a million candidate programs per problem, then filtering and clustering them down to ten submissions.

Language Models · Google Research

BERT Explained: Bidirectional Transformer Pretraining for NLP

BERT pretrains a deep bidirectional Transformer encoder with masked language modeling, then fine-tunes with one extra layer — pushing GLUE to 80.5% and topping 11 NLP tasks.

Language Models · Google DeepMind

Chinchilla: Why Compute-Optimal LLMs Beat Bigger Ones

DeepMind's Chinchilla shows model size and training tokens should scale equally. A 70B model on ~1.4T tokens beats Gopher 280B, GPT-3 175B, and MT-NLG 530B.

Code Generation · Meta AI

Code Llama: Open Code Models Built on Llama 2 (7B-70B)

Code Llama continues training Llama 2 on code, reaching up to 67% on HumanEval and 65% on MBPP, the best open scores at its release, with infilling, instruction following, and 100k-token context support.

Self-Supervised Learning · Meta AI

DINOv2: Self-Supervised Visual Features That Skip Finetuning

DINOv2 pretrains Vision Transformers with no labels on a curated 142M-image set, then freezes the backbone — a linear probe on top matches or beats OpenCLIP on most image- and pixel-level benchmarks.

Multimodal Models · Google DeepMind

Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo bolts trainable cross-attention onto a frozen vision encoder and a frozen language model, then learns new image and video tasks from a handful of in-context examples — no fine-tuning.

Language Models · OpenAI

GPT-3 Explained: When the Prompt Became the Programming Interface

GPT-3 is a 175B-parameter autoregressive language model that performs translation, QA, and reasoning tasks from a few in-prompt examples, with no gradient updates or task-specific fine-tuning.

Text-to-Image · Google Research

Imagen: Why a Frozen Text Encoder Beats a Bigger Image Model

Google's Imagen hit a new COCO FID of 7.27 without training on COCO, and showed that scaling a frozen T5-XXL text encoder lifts fidelity and alignment more than scaling the diffusion model.

Alignment · OpenAI

InstructGPT: How RLHF Beat a Model 100x Its Size

OpenAI's InstructGPT used human feedback to align GPT-3, and evaluators preferred its 1.3B model over the 175B GPT-3 — more helpful with 100x fewer parameters.

Language Models · Google Research

PaLM: Scaling a 540B Dense Language Model with Pathways

PaLM is a 540-billion-parameter dense Transformer trained on 6,144 TPU v4 chips with Pathways. It hit breakthrough few-shot results and beat average human scores on BIG-bench.

Segmentation · Meta AI

Segment Anything (SAM): One Promptable Model, a Billion Masks

Meta AI's SAM treats segmentation as a promptable task and ships with SA-1B (1.1B masks on 11M images), letting one model transfer zero-shot to new objects and image distributions.

Language Models · Google Research

T5 Explained: One Text-to-Text Interface for Every NLP Task

T5 reframes every NLP task as text-in, text-out, then runs a systematic sweep over objectives, architectures, data, and scale. The 11B model set state of the art on GLUE, SuperGLUE, and SQuAD.

Vision Foundation Models · Google Research

Vision Transformer (ViT): An Image is Worth 16x16 Words

ViT shows a plain Transformer fed raw 16x16 image patches beats top CNNs once pre-trained on JFT-300M, reaching 88.55% on ImageNet while using far less training compute.

Speech Recognition · OpenAI

Whisper: 680,000 Hours of Weak Supervision for Robust ASR

OpenAI's Whisper trains a single sequence-to-sequence model on 680,000 hours of web audio. It matches fully supervised systems zero-shot — no fine-tuning — and adds translation and language ID.

Biomolecular Modeling · Google DeepMind

AlphaFold 3 Explained: Predicting Biomolecular Complexes With Diffusion

AlphaFold 3 replaces AlphaFold 2's structure module with a diffusion network and predicts whole complexes — proteins with nucleic acids, ligands, ions, and modified residues — in one model.

Theorem Proving · Google DeepMind

AlphaGeometry: Olympiad Geometry Without Human Proofs

AlphaGeometry pairs a language model with a symbolic engine and trains on 100M synthetic theorems, solving 25 of 30 olympiad geometry problems versus 10 for the prior best.

Transformers · Google Research

Attention Is All You Need: The Transformer Architecture Explained

The 2017 Transformer dropped recurrence and convolution for pure attention, hit 28.4 BLEU on WMT14 EN-DE and 41.8 on EN-FR, and trained in 3.5 days on 8 GPUs. Nearly every modern LLM inherits it.

Multimodal Models · OpenAI

CLIP: Learning Visual Models From Natural Language Supervision

CLIP trains paired image and text encoders on 400 million internet image-text pairs, then matches the original ResNet-50's ImageNet accuracy zero-shot — without using any of its 1.28M labeled examples.

Text-to-Image · OpenAI

DALL·E 2 (unCLIP): Text-to-Image via CLIP Image Latents

DALL·E 2, called unCLIP in the paper, generates a CLIP image embedding from text with a prior, then renders it with a diffusion decoder — buying more diversity at almost no cost to photorealism or caption match.

Alignment · Stanford University

DPO Explained: Aligning LLMs Without the RLHF Reward Model

Direct Preference Optimization solves the RLHF problem with a single classification-style loss on preference pairs — no separate reward model, no RL loop, no sampling during training.

Efficient AI · Stanford University

FlashAttention Explained: IO-Aware Exact Attention, 2-4x Faster

FlashAttention is an exact attention algorithm that uses tiling and recomputation to cut GPU memory traffic, delivering 3x on GPT-2, 15% on BERT-large, and linear memory in sequence length.

Long Context · Google DeepMind

Gemini 1.5: Near-Perfect Recall Across Millions of Tokens

Gemini 1.5 Pro and Flash keep >99% retrieval recall up to at least 10M tokens of text, video, and audio — and Pro matches Gemini 1.0 Ultra with far less compute.

Multimodal Models · OpenAI

GPT-4 Technical Report Explained: Benchmarks, Not Blueprints

OpenAI's GPT-4 report is a measurement document, not a recipe. It hits human-level scores on professional and academic exams — bar exam ~top 10% — yet discloses no architecture, data, or compute.

Open Models · Meta AI

Llama 3: A 405B Dense Open Model That Matches GPT-4

Meta released Llama 3 as a herd of language models led by a dense 405B-parameter flagship with a 128K context window, trained on 15T+ tokens and openly published with weights.

Sequence Modeling · Carnegie Mellon University

Mamba: Selective State Spaces for Linear-Time Sequence Modeling

Mamba makes state space model parameters depend on the input, so it selectively remembers or forgets tokens. It scales linearly, runs 5x faster than Transformers, and Mamba-3B matches Transformers twice its size.

Diffusion Models · CompVis

Latent Diffusion Models: The Architecture Behind Stable Diffusion

Latent diffusion runs denoising inside a pretrained autoencoder's compressed latent space instead of raw pixels, cutting training and inference cost while adding cross-attention conditioning for text and layout.

Vision-Language-Action · Physical Intelligence

π0 Explained: A Vision-Language-Action Flow Model for Robots

π0 bolts a flow-matching action expert onto a pretrained VLM, emitting ~50Hz action chunks so one policy can fold laundry, bus tables, and assemble boxes across single-arm, dual-arm, and mobile robots.

Vision-Language-Action · Google DeepMind

RT-2 Explained: Vision-Language-Action Models for Robot Control

RT-2 co-fine-tunes a web-pretrained vision-language model on robot trajectories, expresses actions as text tokens, and gets emergent generalization to novel objects, unseen commands, and basic reasoning across 6k trials.

Segmentation · Meta AI

SAM 2 Explained: Promptable Segmentation Across Video

SAM 2 carries one click through a whole video using a streaming memory module, hitting better masks with 3x fewer interactions than prior video methods and running 6x faster than SAM on images.

LLM Reasoning · DeepSeek

DeepSeek-R1: How Pure Reinforcement Learning Taught an LLM to Reason

DeepSeek-R1 learns to reason from reinforcement learning on whether its answer is correct — with no human reasoning examples — matches OpenAI o1 on AIME and MATH-500, and ships open MIT-licensed weights.