Topics

Efficient AI

Algorithms and systems that reduce memory, compute, or latency for large models.

MiniMax Sparse Attention: 1M Context Without Dense Attention

MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.

Mixture of Experts · Renmin University of China

Manifold Power Iteration: A Better Router for MoE Models

MPI redesigns MoE routers by aligning router rows with expert weight directions. On 11B MoE, average benchmark accuracy rises from 40.92 to 42.76 with only 0.2% training slowdown.

Small Language Models · Hugging Face

DistilBERT: A Smaller and Faster Version of BERT

DistilBERT turns knowledge distillation for compact language models into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Biomolecular Modeling · Independent Researcher

DynamicMPNN: Multi-State Protein Design with Inverse Folding

DynamicMPNN turns multi-state protein sequence design into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Diffusion Language Models · Independent Researcher

Factorization-Error-Free Decoding for Diffusion LMs

Factorization-error-free decoding turns speculative decoding for discrete diffusion LMs into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Diffusion Models · The Hong Kong Polytechnic University

GGT-100K: Generative Ground Truth for Image Restoration

GGT-100K: Generative Ground Truth for Image Restoration turns real-world image restoration data into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Small Language Models · Google Research

MobileBERT: Compact BERT for Resource-Limited Devices

MobileBERT turns mobile-friendly BERT compression into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

AI Agents · HKUST

StreamMA: Streaming Beats Waiting in Multi-Agent Reasoning

StreamMA pipes each reasoning step to the next agent the moment it is written, not after the full chain. Across 8 benchmarks it gains +7.3 pp on average (max +22.4 pp on HMMT 2026) and runs up to 26.9x faster.

LLM Reasoning · Shanghai AI Laboratory

ThoughtFold: Cutting 56% of Reasoning Tokens Without Losing Accuracy

ThoughtFold trims the redundant reasoning of DeepSeek-R1-Distill-Qwen-7B by about 56% of tokens while keeping accuracy on AIME, MATH-500, and GPQA-Diamond intact, using a masked preference objective.

Fine-Tuning & Adaptation · The Hong Kong Polytechnic University

Token Teachability: Distilling LLMs on Just 5% of Tokens

Teachability-Aware OPD supervises only ~5% of tokens, those where the teacher's correction lands inside the student's top-K support, matching or beating full-token distillation (44.89 vs 42.37 on Qwen3-4B to 1.7B).

Efficient AI · Shanghai AI Laboratory

Draft-OPD: On-Policy Distillation Pushes Speculative Decoding Past 5x

Draft-OPD trains speculative draft models on states their own drafting induces, not just target transcripts. On Qwen3 thinking models it hits 4.86x to 4.89x, beating EAGLE-3 by 23 percent and DFlash by 13 percent.

Small Language Models · Meta AI

MobileLLM: Better Sub-Billion Models for Devices

MobileLLM argues architecture matters more at sub-billion scale: deep-thin designs plus sharing improve 125M/350M models by 2.7%/4.3%, then 0.7%/0.8% more.

Text-to-Image · Alibaba Qwen Team

Qwen-Image-Flash: Beyond Objective Design in Few-Step Distillation

Qwen-Image-Flash distills Qwen-Image-2.0 to 4 sampling steps for both text-to-image and editing. The Alibaba Qwen team shows the training recipe — data, teachers, task mix — matters as much as the distillation objective.

Video Generation · Virginia Tech

VideoMLA: A Low-Rank Latent KV Cache for Minute-Scale Video Diffusion

VideoMLA ports Multi-Head Latent Attention into causal video diffusion, cutting per-token KV memory 92.7% (224 vs 3,072 scalars), winning VBench at 60s, and lifting B200 throughput 1.23x.

Retrieval-Augmented Generation · Universidad de San Andres

Active Learners as Efficient PRP Rerankers: Fewer LLM Calls

Treating pairwise LLM reranking as active learning, a tournament selector hits 68.00 NDCG@10 on TREC DL while cutting LLM calls 3-5x versus sorting-based PRP, plus a randomized-direction oracle that debiases in one call.

Diffusion Models · NVIDIA

AnyFlow: Any-Step Video Diffusion via Flow Map Distillation

AnyFlow distills a video diffusion model that keeps improving as you add sampling steps, fixing the quality drop consistency-distilled models suffer at higher step counts. Tested on Wan2.1 from 1.3B to 14B.

Diffusion Models · Tsinghua University

Causal Forcing++: Few-Step Autoregressive Diffusion for Real-Time Video

Causal Forcing++ distills bidirectional video diffusion into a 1-2 step frame-wise autoregressive generator at 14.1 FPS, halves first-frame latency, and cuts few-step training cost ~4x (11,600 to 2,900 A800 GPU hours).

Code Generation · University of Waterloo

Code2LoRA: Hypernetworks That Generate Repo-Specific LoRA Adapters

Code2LoRA trains a hypernetwork to emit a repo-specific LoRA adapter for a code model with no inference-time token cost — 66.2% in-repo and 63.8% cross-repo exact match, plus an Evo variant that tracks diffs with a GRU.

Open Models · DeepSeek

DeepSeek-V3 Explained: A 671B MoE Trained for 2.788M GPU Hours

DeepSeek-V3 is a 671B-parameter MoE model that activates only 37B params per token, matches leading closed models on many benchmarks, and was pre-trained on 14.8T tokens for just 2.788M H800 GPU hours with open weights.

Efficient AI · Shanghai Jiao Tong University

Domino: Splitting the Draft and the Causal Fix in Speculative Decoding

Domino lets a parallel drafter propose a whole block at once, then a lightweight head adds back the token-to-token dependencies — reaching up to 5.49x speedup on Transformers and 5.8x throughput on SGLang.

Long Context · Tencent

FlashMemory-DeepSeek-V4: Cutting KV Cache to 13.5% for 500K Context

FlashMemory-DeepSeek-V4 keeps only the KV chunks a neural indexer predicts you will need, shrinking physical KV cache to 13.5% of full-context decoding while accuracy stays flat or edges up ~0.6%.

Efficient AI · Alibaba Qwen Team

Full Attention Strikes Back: RTPurbo Sparsifies LLMs in Hundreds of Steps

RTPurbo converts a trained full-attention LLM into a sparse one with about 600+600 adaptation steps, keeping LongBench accuracy (54.24 vs 53.80) while hitting 9.36x prefill speedup at 1M context.

Efficient AI · Sapient Intelligence

HRM-Text: A 1B Model Trained From Scratch for $1,500

HRM-Text trains a 1B language model from scratch on 40B tokens for about $1,500, scoring 60.7% MMLU, 84.5% GSM8K and 56.2% MATH by swapping Transformers for a hierarchical recurrent model.

Efficient AI · Huawei

KVarN: 2-Bit KV-Cache Quantization Without Calibration

KVarN compresses the KV-cache to 2 bits with no calibration data, using a Hadamard rotation plus dual-axis variance normalization to stop quantization errors from snowballing across long reasoning chains.

World Models · Microsoft Research

Mirage: Latent Spatial Memory Makes Video World Models 10x Faster

Mirage stores a video world model's 3D memory inside diffusion latent space instead of an RGB point cloud, hitting state-of-the-art WorldScore (70.36) while running 10.57x faster and using 55x less GPU memory.

Text-to-Image · Microsoft Research

Lens: A 3.8B Text-to-Image Model Trained on ~19% of Z-Image's Compute

Microsoft's Lens is a 3.8B-parameter text-to-image diffusion model that matches 6B+ rivals while using about 19.3% of Z-Image's training compute, mostly by feeding it longer, denser captions.

Diffusion Models · NVIDIA

LongLive-2.0: NVFP4 4-bit Training and Inference for Long Video

LongLive-2.0 runs a 5B long-video model end to end in NVFP4 4-bit, hitting 45.7 FPS at 720p, 2.1x faster training and 1.84x faster inference, while VBench total drops only ~0.5 points from BF16.

Multimodal Models · NVIDIA

LocateAnything: Parallel Box Decoding for Faster Vision Grounding

LocateAnything emits each bounding box in a single decoding step instead of digit-by-digit, hitting 12.7 boxes/sec in hybrid mode — about 2.5x faster than Rex-Omni-3B — while leading on COCO and LVIS at the same 3B size.

Efficient AI · Microsoft Research

LoRA Explained: Low-Rank Adaptation for Fine-Tuning LLMs

LoRA freezes a pretrained model and trains tiny low-rank matrices per layer instead — cutting trainable parameters up to 10,000x and GPU memory 3x versus full GPT-3 175B fine-tuning, with no extra latency.

Long Context · Shanghai AI Laboratory

δ-mem: An 8×8 Online Memory That Boosts Frozen LLMs

δ-mem bolts a tiny 8×8 delta-rule memory onto a frozen LLM and lifts average long-memory scores 1.10× over the backbone and 1.15× over other memory methods — no fine-tuning, no context extension.

Fine-Tuning & Adaptation · Mind Lab

MinT: Infrastructure for Training and Serving Millions of LoRA LLMs

MinT keeps one frontier base model resident and swaps only LoRA adapters, cutting the model-handoff step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while addressing million-scale adapter catalogs.

Open Models · Mistral AI

Mistral 7B: The 7B Open Model That Beat Llama 2 13B

Mistral 7B is a 7-billion-parameter open model that outperforms Llama 2 13B on every benchmark tested, uses grouped-query and sliding-window attention for cheap inference, and ships under Apache 2.0.

Retrieval-Augmented Generation · AIRI

OCC-RAG: Small Models Built Only to Read Context Faithfully

OCC-RAG is a pair of 0.6B and 1.7B reasoning models trained to answer strictly from the given context and refuse when the answer isn't there — matching or beating general models 2-6x their size on multi-hop QA.

Fine-Tuning & Adaptation · Mind Lab

Scaling PEFT: Toward a Million Personal Models on One Base

A position paper reframing LoRA adapters as persistent personal state, not a cheap full-finetune substitute, across three axes: scale up the base, scale down the adapter, scale out to millions, plus a serving stack MinT.

Efficient AI · Microsoft Research

Phi-3-mini: A 3.8B Model That Rivals GPT-3.5 on Your Phone

Phi-3-mini is a 3.8B-parameter model trained on 3.3T heavily filtered and synthetic tokens that hits 69% on MMLU and 8.38 on MT-bench — matching Mixtral 8x7B and GPT-3.5 while small enough to run on a phone.

Diffusion Models · University of Science and Technology of China

Stream-T1: Test-Time Scaling for Streaming Video Generation

Stream-T1 adds test-time search to streaming video generation without retraining, lifting VideoAlign motion quality from 0.350 to 0.629 at 5s and cutting the drift that wrecks 30-second clips.

Mixture of Experts · Google Research

Switch Transformer: One Expert Per Token, Up to a Trillion Parameters

Switch Transformer simplifies Mixture-of-Experts by routing each token to a single expert, hitting up to 7x faster T5 pretraining at fixed compute and scaling to 1.6 trillion parameters with bfloat16 training.

Fine-Tuning & Adaptation · T-Tech

Trust-Region Behavior Blending: A Warmup Fix for On-Policy Distillation

On-policy distillation wastes teacher supervision on a student's weak early rollouts. TRB blends teacher-like behavior inside a KL trust region during warmup, then anneals it to zero — best average on two math settings.

Efficient AI · Stanford University

FlashAttention Explained: IO-Aware Exact Attention, 2-4x Faster

FlashAttention is an exact attention algorithm that uses tiling and recomputation to cut GPU memory traffic, delivering 3x on GPT-2, 15% on BERT-large, and linear memory in sequence length.

Sequence Modeling · Carnegie Mellon University

Mamba: Selective State Spaces for Linear-Time Sequence Modeling

Mamba makes state space model parameters depend on the input, so it selectively remembers or forgets tokens. It scales linearly, runs 5x faster than Transformers, and Mamba-3B matches Transformers twice its size.