Language Models

BERT Explained: Bidirectional Transformer Pretraining for NLP

BERT pretrains a deep bidirectional Transformer encoder with masked language modeling, then fine-tunes with one extra layer — pushing GLUE to 80.5% and topping 11 NLP tasks.

Language Models · Google DeepMind

Chinchilla: Why Compute-Optimal LLMs Beat Bigger Ones

DeepMind's Chinchilla shows model size and training tokens should scale equally. A 70B model on ~1.4T tokens beats Gopher 280B, GPT-3 175B, and MT-NLG 530B.

Language Models · OpenAI

GPT-3 Explained: When the Prompt Became the Programming Interface

GPT-3 is a 175B-parameter autoregressive language model that performs translation, QA, and reasoning tasks from a few in-prompt examples, with no gradient updates or task-specific fine-tuning.

Foundational papers

Text Embeddings · Independent Researcher

BERT Explained: Bidirectional Transformer Pretraining for NLP

BERT pretrains a deep bidirectional Transformer encoder with masked language modeling, then fine-tunes with one extra layer — pushing GLUE to 80.5% and topping 11 NLP tasks.

Sentence-BERT: Sentence Embeddings with Siamese BERT

Sentence-BERT turns sentence embeddings for semantic similarity into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

DistilBERT: A Smaller and Faster Version of BERT

DistilBERT turns knowledge distillation for compact language models into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

T5 Explained: One Text-to-Text Interface for Every NLP Task

T5 reframes every NLP task as text-in, text-out, then runs a systematic sweep over objectives, architectures, data, and scale. The 11B model set state of the art on GLUE, SuperGLUE, and SQuAD.

Recent papers

AI Agents · Independent Researcher

ArcANE: Measuring When Role-Playing Agents Break Character

ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Diffusion Language Modeling: Promises and Challenges

Diffusion language modeling survey turns the state of diffusion language modeling into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

DistilBERT: A Smaller and Faster Version of BERT

DistilBERT turns knowledge distillation for compact language models into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text Embeddings · Microsoft Research

E5: Weakly-Supervised Contrastive Text Embeddings

E5 turns general-purpose text embeddings into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Language Models · Independent Researcher

Factorization-Error-Free Decoding for Diffusion LMs

Factorization-error-free decoding turns speculative decoding for discrete diffusion LMs into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

WASH: Averaging 3 LLMs Erases Text Watermarks

Averaging the output distributions of 3 independent LLMs collapses watermark detection z-scores from 5-300 down below 2, and the WASH paper proves why it works with an O(1/sqrt(N)) error bound.

AI Agents · Independent Researcher

ArcANE: Measuring When Role-Playing Agents Break Character

ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Diffusion Language Modeling: Promises and Challenges

Diffusion language modeling survey turns the state of diffusion language modeling into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

DistilBERT: A Smaller and Faster Version of BERT

DistilBERT turns knowledge distillation for compact language models into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text Embeddings · Microsoft Research

E5: Weakly-Supervised Contrastive Text Embeddings

E5 turns general-purpose text embeddings into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Language Models · Independent Researcher

Factorization-Error-Free Decoding for Diffusion LMs

Factorization-error-free decoding turns speculative decoding for discrete diffusion LMs into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

WASH: Averaging 3 LLMs Erases Text Watermarks

Averaging the output distributions of 3 independent LLMs collapses watermark detection z-scores from 5-300 down below 2, and the WASH paper proves why it works with an O(1/sqrt(N)) error bound.

Small Language Models · Google Research

MobileBERT: Compact BERT for Resource-Limited Devices

MobileBERT turns mobile-friendly BERT compression into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Biomolecular Modeling · Independent Researcher

ProGen2: Protein Language Models for Protein Design

ProGen2 turns protein sequence modeling and design into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text Embeddings · Independent Researcher

SEDD: Discrete Diffusion Language Modeling by Ratios

SEDD turns discrete diffusion language modeling into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Sentence-BERT: Sentence Embeddings with Siamese BERT

Sentence-BERT turns sentence embeddings for semantic similarity into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text Embeddings · Princeton University

SimCSE: Contrastive Learning for Sentence Embeddings

SimCSE turns contrastive sentence embedding learning into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Small Language Models · Independent Researcher

TinyLlama: An Open Small Language Model Recipe

TinyLlama turns open small language model training into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Diffusion Language Models · Stanford University

Diffusion-LM: Controllable Text from Denoising

Diffusion-LM uses continuous denoising over word vectors so gradient guidance can control syntax and other fine-grained attributes without retraining the LM.

Mixture of Experts · National University of Singapore

dMoE: Block-Level Expert Routing for Diffusion LLMs

dMoE aligns token-level MoE routing with block-parallel decoding in diffusion LLMs. On LLaDA2.0-mini it cuts unique experts per block from 69.5 to 14.6, keeps 99.11% accuracy, and frees 76-80% of expert memory.

Efficient AI · Shanghai AI Laboratory

Draft-OPD: On-Policy Distillation Pushes Speculative Decoding Past 5x

Draft-OPD trains speculative draft models on states their own drafting induces, not just target transcripts. On Qwen3 thinking models it hits 4.86x to 4.89x, beating EAGLE-3 by 23 percent and DFlash by 13 percent.

Biomolecular Modeling · EvolutionaryScale

ESM3: Protein Generation as Evolutionary Simulation

ESM3 is a multimodal protein language model over sequence, structure, and function; it generated a fluorescent protein only 58% identical to known fluorescent proteins.

Interpretability · Google DeepMind

Gemma Scope: DeepMind's Open SAE Suite for Interpreting Gemma 2

Gemma Scope is a free, open suite of JumpReLU sparse autoencoders covering every layer of Gemma 2 2B and 9B (plus parts of 27B) — over 400 SAEs and 30M+ features, costing more than 20% of GPT-3's compute to train.

Diffusion Language Models · Renmin University of China

Language Models Need Sleep: A Consolidate-and-Dream Recipe

Google Research argues LLMs need an offline sleep phase to turn short-term context into stable weights. With sleep, Qwen3-8B hits 79.2% on AIME-24 and a Transformer reaches 80% on ARC few-shot, beating SEAL.

LLaDA: An 8B Diffusion Language Model That Rivals LLaMA3

LLaDA trains an 8B language model by masked diffusion instead of next-token prediction, matches LLaMA3 8B in in-context learning, hits 70.7 on GSM8K, and beats GPT-4o on the reversal-curse poem task.

Small Language Models · Meta AI

MobileLLM: Better Sub-Billion Models for Devices

MobileLLM argues architecture matters more at sub-billion scale: deep-thin designs plus sharing improve 125M/350M models by 2.7%/4.3%, then 0.7%/0.8% more.

Language Models · Xiaohongshu

NITP: Predict the Next Token's Meaning, Not Just Its ID

NITP adds a dense target to next-token prediction: forecast a shallow-layer embedding of the next token. On a 9B MoE it lifts MMLU-Pro by 5.71 points for about 2 percent extra training FLOPs and zero inference cost.

Interpretability · Northeastern University

Position-Aware Circuit Discovery for Language Models

This work fixes a blind spot in automatic circuit discovery: model components can matter at specific token positions, so position-invariant circuits miss real mechanisms.

Retrieval-Augmented Generation · Universidad de San Andres

SmolLM2: A Fully Open 1.7B Model Built on a Public Data Recipe

SmolLM2 is a 1.7B model overtrained on ~11T tokens through four data stages. It scores 68.7 on HellaSwag and 19.4 on MMLU-Pro, beating Llama3.2-1B — and ships every dataset, not just the weights.

Interpretability · EleutherAI

Sparse Autoencoders Find Interpretable Features in LLMs

Training a sparse autoencoder on a language model's activations pulls apart 'superposition' into single-meaning features more interpretable than neurons — and lets you edit one concept and watch behavior change.

Speech Synthesis · Microsoft Research

VALL-E: Zero-Shot Voice Cloning with Audio Tokens

VALL-E reframes TTS as codec-token language modeling: 60K hours of speech plus a 3-second prompt produce personalized zero-shot speech, but safety and release constraints matter.

Active Learners as Efficient PRP Rerankers: Fewer LLM Calls

Treating pairwise LLM reranking as active learning, a tournament selector hits 68.00 NDCG@10 on TREC DL while cutting LLM calls 3-5x versus sorting-based PRP, plus a randomized-direction oracle that debiases in one call.

Code Generation · Google DeepMind

AlphaCode Explained: Competition-Level Code Generation

DeepMind's AlphaCode averaged a top 54.3% ranking on Codeforces contests with 5,000+ participants by generating up to a million candidate programs per problem, then filtering and clustering them down to ten submissions.

AI Agents · Shanghai AI Laboratory

Pi-Bench: Can AI Assistants Anticipate What You Did Not Say?

Pi-Bench scores agents on proactivity, not just task completion, across 100 long-horizon tasks. The best model, GPT-5.4, hits only 67.0% proactivity, and removing prior sessions drops it 9.5 points.

LLM Reasoning · Renmin University of China

BERT Explained: Bidirectional Transformer Pretraining for NLP

BERT pretrains a deep bidirectional Transformer encoder with masked language modeling, then fine-tunes with one extra layer — pushing GLUE to 80.5% and topping 11 NLP tasks.

Language Models · Google DeepMind

Chinchilla: Why Compute-Optimal LLMs Beat Bigger Ones

DeepMind's Chinchilla shows model size and training tokens should scale equally. A 70B model on ~1.4T tokens beats Gopher 280B, GPT-3 175B, and MT-NLG 530B.

LLM Reasoning · Google Research

Chain-of-Thought Prompting: How Showing the Steps Unlocks LLM Reasoning

Showing a few worked examples with intermediate reasoning steps lets big models solve multi-step problems — a 540B model with 8 chain-of-thought exemplars hits 57% on GSM8K, beating fine-tuned GPT-3 with a verifier.

AI Agents · Shanghai AI Laboratory

COLLEAGUE.SKILL: Turning One Person's Expertise Into a Portable AI Skill

COLLEAGUE.SKILL distills one person's work traces into a versioned skill package with two tracks — capability and bounded behavior — that any agent can install, correct, and roll back. The open repo reports ~18.5k stars.

Code Generation · Meta AI

Code Llama: Open Code Models Built on Llama 2 (7B-70B)

Code Llama continues training Llama 2 on code, reaching up to 67% on HumanEval and 65% on MBPP, the best open scores at its release, with infilling, instruction following, and 100k-token context support.

Alignment · Anthropic

Constitutional AI: Training a Harmless Assistant from AI Feedback

Constitutional AI trains a harmless assistant with almost no human harm labels — a model critiques and revises its own answers against a written list of principles, then learns from AI-generated preferences (RLAIF).

DelTA: Discriminative Token Credit Assignment for RLVR Reasoning

DelTA reweights RLVR updates so credit lands on tokens that actually separate right answers from wrong ones, lifting Qwen3-8B-Base by 3.26 and Qwen3-14B-Base by 2.62 average points over the strongest baselines.

Efficient AI · Shanghai Jiao Tong University

Domino: Splitting the Draft and the Causal Fix in Speculative Decoding

Domino lets a parallel drafter propose a whole block at once, then a lightweight head adds back the token-to-token dependencies — reaching up to 5.49x speedup on Transformers and 5.8x throughput on SGLang.

LLM Reasoning · Alibaba Qwen Team

DVAO: Variance-Adaptive Advantage Weighting for Multi-Reward RL

DVAO weights each reward by its in-group variance instead of fixed coefficients, lifting Qwen3-4B-Base from 38.99% to 42.19% average accuracy and length compliance to 99.91% in math-plus-tool-use RL.

Long Context · University of Illinois Urbana-Champaign

From Context to Skills: Ctx2Skill Self-Evolves Context Learning

Ctx2Skill is a self-play framework that discovers natural-language skills from a long context with no human labels or external rewards, lifting GPT-4.1 from 11.1% to 16.5% and GPT-5.1 from 21.2% to 25.8% on CL-bench.

Open Models · Google DeepMind

Gemma Explained: Google DeepMind's Open Models from Gemini Tech

Gemma is a 2B and 7B family of open-weight models distilled from Gemini research that beats similarly sized open models on 11 of 18 text tasks, shipped with pretrained and instruction-tuned checkpoints.

Language Models · OpenAI

GPT-3 Explained: When the Prompt Became the Programming Interface

GPT-3 is a 175B-parameter autoregressive language model that performs translation, QA, and reasoning tasks from a few in-prompt examples, with no gradient updates or task-specific fine-tuning.

Retrieval-Augmented Generation · University of Massachusetts Amherst

GrepSeek: Search Agents That grep the Corpus Instead of a Vector Index

GrepSeek trains an LLM to answer questions by issuing shell commands like grep against the raw corpus — no embedding index — and posts the best F1 and Exact Match across seven open-domain QA benchmarks.

AI Agents · University of Illinois Urbana-Champaign

Eywa: Letting LLM Agents Call Scientific Foundation Models

Eywa lets an LLM agent invoke domain models like Chronos and TabPFN through a learned interface instead of serializing data into text. On EywaBench it lifts utility from 0.6154 to 0.6558 while cutting ~30% tokens.

Efficient AI · Sapient Intelligence

HRM-Text: A 1B Model Trained From Scratch for $1,500

HRM-Text trains a 1B language model from scratch on 40B tokens for about $1,500, scoring 60.7% MMLU, 84.5% GSM8K and 56.2% MATH by swapping Transformers for a hierarchical recurrent model.

Alignment · OpenAI

InstructGPT: How RLHF Beat a Model 100x Its Size

OpenAI's InstructGPT used human feedback to align GPT-3, and evaluators preferred its 1.3B model over the 175B GPT-3 — more helpful with 100x fewer parameters.

Open Models · Meta AI

Llama 2 Explained: Meta's Open Weights and the RLHF Chat Recipe

Llama 2 shipped 7B, 13B, and 70B open-weight models plus Llama 2-Chat, the first open chat model whose RLHF pipeline — including a separate safety reward model and Ghost Attention — was documented in full.

Multimodal Models · Microsoft Research

LLaVA Explained: Visual Instruction Tuning for a Vision-Language Chat Model

LLaVA bolts a CLIP vision encoder onto a Vicuna LLM with one linear projection, then trains on GPT-4-generated image instructions — hitting 85.1% of GPT-4's score and 92.53% on ScienceQA.

Long Context · Shanghai AI Laboratory

δ-mem: An 8×8 Online Memory That Boosts Frozen LLMs

δ-mem bolts a tiny 8×8 delta-rule memory onto a frozen LLM and lifts average long-memory scores 1.10× over the backbone and 1.15× over other memory methods — no fine-tuning, no context extension.

AI Agents · MemTensor

MemPrivacy: Private Edge-Cloud Agent Memory via Reversible Placeholders

MemPrivacy swaps sensitive spans for type-aware placeholders on-device, processes memory in the cloud over them, then restores them locally — utility loss stays within 1.6% and 0.6B-4B models beat GPT-5.2 at detection.

Retrieval-Augmented Generation · Meta AI

PaLM: Scaling a 540B Dense Language Model with Pathways

PaLM is a 540-billion-parameter dense Transformer trained on 6,144 TPU v4 chips with Pathways. It hit breakthrough few-shot results and beat average human scores on BIG-bench.

Efficient AI · Microsoft Research

Phi-3-mini: A 3.8B Model That Rivals GPT-3.5 on Your Phone

Phi-3-mini is a 3.8B-parameter model trained on 3.3T heavily filtered and synthetic tokens that hits 69% on MMLU and 8.38 on MT-bench — matching Mixtral 8x7B and GPT-3.5 while small enough to run on a phone.

Alignment · Seoul National University

Why Personality Tests Mischaracterize LLM Behavior

Giving an LLM the Big Five or a values survey predicts almost nothing about how it acts in real queries: cross-method agreement was only Spearman 0.31 (values) and 0.26 (personality), versus 0.74-0.77 within-survey.

Open Models · Alibaba Qwen Team

Qwen2.5 Explained: Alibaba's Open LLM Family, 0.5B to 72B

Qwen2.5 is Alibaba's open-weight LLM family spanning 0.5B–72B, pretrained on 18T tokens; the 72B-Instruct flagship rivals Llama-3-405B-Instruct, a model roughly 5x larger.

LLM Reasoning · Princeton University

ReAct: How Interleaving Reasoning and Acting Built the LLM Agent

ReAct interleaves a model's reasoning traces with task actions like search and API calls, cutting chain-of-thought hallucination and beating RL agents on ALFWorld by 34% absolute with one or two examples.

RAG (2020): The Paper That Named Retrieval-Augmented Generation

The original RAG paper bolts a Wikipedia dense retriever (DPR) onto a BART seq2seq generator, set new state-of-the-art on three open-domain QA tasks, and updates knowledge by swapping the index — no retraining.

AI Agents · Microsoft Research

SkillOpt: Training a Frozen Agent's Skill Text Like a Model

SkillOpt trains a single skill document for a frozen LLM agent with bounded add/delete/replace edits and a held-out gate, lifting GPT-5.5 by +23.5 points in direct chat across six benchmarks.