AI Agents · Independent Researcher
ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Diffusion Language Models · Independent Researcher
Diffusion language modeling survey turns the state of diffusion language modeling into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Small Language Models · Hugging Face
DistilBERT turns knowledge distillation for compact language models into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Text Embeddings · Microsoft Research
E5 turns general-purpose text embeddings into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Diffusion Language Models · Independent Researcher
Factorization-error-free decoding turns speculative decoding for discrete diffusion LMs into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Language Models · Independent Researcher
Averaging the output distributions of 3 independent LLMs collapses watermark detection z-scores from 5-300 down below 2, and the WASH paper proves why it works with an O(1/sqrt(N)) error bound.
Small Language Models · Google Research
MobileBERT turns mobile-friendly BERT compression into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Biomolecular Modeling · Independent Researcher
ProGen2 turns protein sequence modeling and design into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Diffusion Language Models · Independent Researcher
SEDD turns discrete diffusion language modeling into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Text Embeddings · Independent Researcher
Sentence-BERT turns sentence embeddings for semantic similarity into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Text Embeddings · Princeton University
SimCSE turns contrastive sentence embedding learning into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Small Language Models · Independent Researcher
TinyLlama turns open small language model training into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Diffusion Language Models · Stanford University
Diffusion-LM uses continuous denoising over word vectors so gradient guidance can control syntax and other fine-grained attributes without retraining the LM.
Mixture of Experts · National University of Singapore
dMoE aligns token-level MoE routing with block-parallel decoding in diffusion LLMs. On LLaDA2.0-mini it cuts unique experts per block from 69.5 to 14.6, keeps 99.11% accuracy, and frees 76-80% of expert memory.
Efficient AI · Shanghai AI Laboratory
Draft-OPD trains speculative draft models on states their own drafting induces, not just target transcripts. On Qwen3 thinking models it hits 4.86x to 4.89x, beating EAGLE-3 by 23 percent and DFlash by 13 percent.
Biomolecular Modeling · EvolutionaryScale
ESM3 is a multimodal protein language model over sequence, structure, and function; it generated a fluorescent protein only 58% identical to known fluorescent proteins.
Interpretability · Google DeepMind
Gemma Scope is a free, open suite of JumpReLU sparse autoencoders covering every layer of Gemma 2 2B and 9B (plus parts of 27B) — over 400 SAEs and 30M+ features, costing more than 20% of GPT-3's compute to train.
Language Models · Google Research
Google Research argues LLMs need an offline sleep phase to turn short-term context into stable weights. With sleep, Qwen3-8B hits 79.2% on AIME-24 and a Transformer reaches 80% on ARC few-shot, beating SEAL.
Diffusion Language Models · Renmin University of China
LLaDA trains an 8B language model by masked diffusion instead of next-token prediction, matches LLaMA3 8B in in-context learning, hits 70.7 on GSM8K, and beats GPT-4o on the reversal-curse poem task.
Small Language Models · Meta AI
MobileLLM argues architecture matters more at sub-billion scale: deep-thin designs plus sharing improve 125M/350M models by 2.7%/4.3%, then 0.7%/0.8% more.
Language Models · Xiaohongshu
NITP adds a dense target to next-token prediction: forecast a shallow-layer embedding of the next token. On a 9B MoE it lifts MMLU-Pro by 5.71 points for about 2 percent extra training FLOPs and zero inference cost.
Interpretability · Northeastern University
This work fixes a blind spot in automatic circuit discovery: model components can matter at specific token positions, so position-invariant circuits miss real mechanisms.
Small Language Models · Hugging Face
SmolLM2 is a 1.7B model overtrained on ~11T tokens through four data stages. It scores 68.7 on HellaSwag and 19.4 on MMLU-Pro, beating Llama3.2-1B — and ships every dataset, not just the weights.
Interpretability · EleutherAI
Training a sparse autoencoder on a language model's activations pulls apart 'superposition' into single-meaning features more interpretable than neurons — and lets you edit one concept and watch behavior change.
Speech Synthesis · Microsoft Research
VALL-E reframes TTS as codec-token language modeling: 60K hours of speech plus a 3-second prompt produce personalized zero-shot speech, but safety and release constraints matter.
Retrieval-Augmented Generation · Universidad de San Andres
Treating pairwise LLM reranking as active learning, a tournament selector hits 68.00 NDCG@10 on TREC DL while cutting LLM calls 3-5x versus sorting-based PRP, plus a randomized-direction oracle that debiases in one call.
Code Generation · Google DeepMind
DeepMind's AlphaCode averaged a top 54.3% ranking on Codeforces contests with 5,000+ participants by generating up to a million candidate programs per problem, then filtering and clustering them down to ten submissions.
AI Agents · Shanghai AI Laboratory
Pi-Bench scores agents on proactivity, not just task completion, across 100 long-horizon tasks. The best model, GPT-5.4, hits only 67.0% proactivity, and removing prior sessions drops it 9.5 points.
Language Models · Google Research
BERT pretrains a deep bidirectional Transformer encoder with masked language modeling, then fine-tunes with one extra layer — pushing GLUE to 80.5% and topping 11 NLP tasks.
Language Models · Google DeepMind
DeepMind's Chinchilla shows model size and training tokens should scale equally. A 70B model on ~1.4T tokens beats Gopher 280B, GPT-3 175B, and MT-NLG 530B.
LLM Reasoning · Google Research
Showing a few worked examples with intermediate reasoning steps lets big models solve multi-step problems — a 540B model with 8 chain-of-thought exemplars hits 57% on GSM8K, beating fine-tuned GPT-3 with a verifier.
AI Agents · Shanghai AI Laboratory
COLLEAGUE.SKILL distills one person's work traces into a versioned skill package with two tracks — capability and bounded behavior — that any agent can install, correct, and roll back. The open repo reports ~18.5k stars.
Code Generation · Meta AI
Code Llama continues training Llama 2 on code, reaching up to 67% on HumanEval and 65% on MBPP, the best open scores at its release, with infilling, instruction following, and 100k-token context support.
Alignment · Anthropic
Constitutional AI trains a harmless assistant with almost no human harm labels — a model critiques and revises its own answers against a written list of principles, then learns from AI-generated preferences (RLAIF).
LLM Reasoning · Renmin University of China
DelTA reweights RLVR updates so credit lands on tokens that actually separate right answers from wrong ones, lifting Qwen3-8B-Base by 3.26 and Qwen3-14B-Base by 2.62 average points over the strongest baselines.
Efficient AI · Shanghai Jiao Tong University
Domino lets a parallel drafter propose a whole block at once, then a lightweight head adds back the token-to-token dependencies — reaching up to 5.49x speedup on Transformers and 5.8x throughput on SGLang.
LLM Reasoning · Alibaba Qwen Team
DVAO weights each reward by its in-group variance instead of fixed coefficients, lifting Qwen3-4B-Base from 38.99% to 42.19% average accuracy and length compliance to 99.91% in math-plus-tool-use RL.
Long Context · University of Illinois Urbana-Champaign
Ctx2Skill is a self-play framework that discovers natural-language skills from a long context with no human labels or external rewards, lifting GPT-4.1 from 11.1% to 16.5% and GPT-5.1 from 21.2% to 25.8% on CL-bench.
Open Models · Google DeepMind
Gemma is a 2B and 7B family of open-weight models distilled from Gemini research that beats similarly sized open models on 11 of 18 text tasks, shipped with pretrained and instruction-tuned checkpoints.
Language Models · OpenAI
GPT-3 is a 175B-parameter autoregressive language model that performs translation, QA, and reasoning tasks from a few in-prompt examples, with no gradient updates or task-specific fine-tuning.
Retrieval-Augmented Generation · University of Massachusetts Amherst
GrepSeek trains an LLM to answer questions by issuing shell commands like grep against the raw corpus — no embedding index — and posts the best F1 and Exact Match across seven open-domain QA benchmarks.
AI Agents · University of Illinois Urbana-Champaign
Eywa lets an LLM agent invoke domain models like Chronos and TabPFN through a learned interface instead of serializing data into text. On EywaBench it lifts utility from 0.6154 to 0.6558 while cutting ~30% tokens.
Efficient AI · Sapient Intelligence
HRM-Text trains a 1B language model from scratch on 40B tokens for about $1,500, scoring 60.7% MMLU, 84.5% GSM8K and 56.2% MATH by swapping Transformers for a hierarchical recurrent model.
Alignment · OpenAI
OpenAI's InstructGPT used human feedback to align GPT-3, and evaluators preferred its 1.3B model over the 175B GPT-3 — more helpful with 100x fewer parameters.
Open Models · Meta AI
Llama 2 shipped 7B, 13B, and 70B open-weight models plus Llama 2-Chat, the first open chat model whose RLHF pipeline — including a separate safety reward model and Ghost Attention — was documented in full.
Multimodal Models · Microsoft Research
LLaVA bolts a CLIP vision encoder onto a Vicuna LLM with one linear projection, then trains on GPT-4-generated image instructions — hitting 85.1% of GPT-4's score and 92.53% on ScienceQA.
Long Context · Shanghai AI Laboratory
δ-mem bolts a tiny 8×8 delta-rule memory onto a frozen LLM and lifts average long-memory scores 1.10× over the backbone and 1.15× over other memory methods — no fine-tuning, no context extension.
AI Agents · MemTensor
MemPrivacy swaps sensitive spans for type-aware placeholders on-device, processes memory in the cloud over them, then restores them locally — utility loss stays within 1.6% and 0.6B-4B models beat GPT-5.2 at detection.
Language Models · Google Research
PaLM is a 540-billion-parameter dense Transformer trained on 6,144 TPU v4 chips with Pathways. It hit breakthrough few-shot results and beat average human scores on BIG-bench.
Efficient AI · Microsoft Research
Phi-3-mini is a 3.8B-parameter model trained on 3.3T heavily filtered and synthetic tokens that hits 69% on MMLU and 8.38 on MT-bench — matching Mixtral 8x7B and GPT-3.5 while small enough to run on a phone.
Alignment · Seoul National University
Giving an LLM the Big Five or a values survey predicts almost nothing about how it acts in real queries: cross-method agreement was only Spearman 0.31 (values) and 0.26 (personality), versus 0.74-0.77 within-survey.
Open Models · Alibaba Qwen Team
Qwen2.5 is Alibaba's open-weight LLM family spanning 0.5B–72B, pretrained on 18T tokens; the 72B-Instruct flagship rivals Llama-3-405B-Instruct, a model roughly 5x larger.
LLM Reasoning · Princeton University
ReAct interleaves a model's reasoning traces with task actions like search and API calls, cutting chain-of-thought hallucination and beating RL agents on ALFWorld by 34% absolute with one or two examples.
Retrieval-Augmented Generation · Meta AI
The original RAG paper bolts a Wikipedia dense retriever (DPR) onto a BART seq2seq generator, set new state-of-the-art on three open-domain QA tasks, and updates knowledge by swapping the index — no retraining.
AI Agents · Microsoft Research
SkillOpt trains a single skill document for a frozen LLM agent with bounded add/delete/replace edits and a held-out gate, lifting GPT-5.5 by +23.5 points in direct chat across six benchmarks.
Language Models · Google Research
T5 reframes every NLP task as text-in, text-out, then runs a systematic sweep over objectives, architectures, data, and scale. The 11B model set state of the art on GLUE, SuperGLUE, and SQuAD.
LLM Reasoning · Meta AI
Toolformer trains a model to decide which API to call — calculator, QA, search, translation, calendar — purely by keeping the sampled calls that lower next-token loss, with only a handful of demos per tool.
Language Models · Alibaba Qwen Team
TransitLM is a 13M-record corpus from four Chinese cities (120,845 stations) that trains a language model to plan transit routes with no map engine — a 4B model hits 97.0% connectivity and 71.0% exact match.
Text Embeddings · Renmin University of China
EmbFilter reads the LLM unembedding matrix as a lens, strips the subspace that ties text embeddings to high-frequency junk tokens, and lifts zero-shot retrieval while shrinking dimensions.