MiniMax Sparse Attention: 1M Context Without Dense Attention
MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.
Topics
Algorithms and systems that reduce memory, compute, or latency for large models.
MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.
Mixture of Experts · Renmin University of China
MPI redesigns MoE routers by aligning router rows with expert weight directions. On 11B MoE, average benchmark accuracy rises from 40.92 to 42.76 with only 0.2% training slowdown.
Small Language Models · Hugging Face
DistilBERT turns knowledge distillation for compact language models into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Biomolecular Modeling · Independent Researcher
DynamicMPNN turns multi-state protein sequence design into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Diffusion Language Models · Independent Researcher
Factorization-error-free decoding turns speculative decoding for discrete diffusion LMs into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Diffusion Models · The Hong Kong Polytechnic University
GGT-100K: Generative Ground Truth for Image Restoration turns real-world image restoration data into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Small Language Models · Google Research
MobileBERT turns mobile-friendly BERT compression into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
StreamMA pipes each reasoning step to the next agent the moment it is written, not after the full chain. Across 8 benchmarks it gains +7.3 pp on average (max +22.4 pp on HMMT 2026) and runs up to 26.9x faster.
LLM Reasoning · Shanghai AI Laboratory
ThoughtFold trims the redundant reasoning of DeepSeek-R1-Distill-Qwen-7B by about 56% of tokens while keeping accuracy on AIME, MATH-500, and GPQA-Diamond intact, using a masked preference objective.
Fine-Tuning & Adaptation · The Hong Kong Polytechnic University
Teachability-Aware OPD supervises only ~5% of tokens, those where the teacher's correction lands inside the student's top-K support, matching or beating full-token distillation (44.89 vs 42.37 on Qwen3-4B to 1.7B).
Efficient AI · Shanghai AI Laboratory
Draft-OPD trains speculative draft models on states their own drafting induces, not just target transcripts. On Qwen3 thinking models it hits 4.86x to 4.89x, beating EAGLE-3 by 23 percent and DFlash by 13 percent.
Small Language Models · Meta AI
MobileLLM argues architecture matters more at sub-billion scale: deep-thin designs plus sharing improve 125M/350M models by 2.7%/4.3%, then 0.7%/0.8% more.
Text-to-Image · Alibaba Qwen Team
Qwen-Image-Flash distills Qwen-Image-2.0 to 4 sampling steps for both text-to-image and editing. The Alibaba Qwen team shows the training recipe — data, teachers, task mix — matters as much as the distillation objective.
Video Generation · Virginia Tech
VideoMLA ports Multi-Head Latent Attention into causal video diffusion, cutting per-token KV memory 92.7% (224 vs 3,072 scalars), winning VBench at 60s, and lifting B200 throughput 1.23x.
Retrieval-Augmented Generation · Universidad de San Andres
Treating pairwise LLM reranking as active learning, a tournament selector hits 68.00 NDCG@10 on TREC DL while cutting LLM calls 3-5x versus sorting-based PRP, plus a randomized-direction oracle that debiases in one call.
AnyFlow distills a video diffusion model that keeps improving as you add sampling steps, fixing the quality drop consistency-distilled models suffer at higher step counts. Tested on Wan2.1 from 1.3B to 14B.
Diffusion Models · Tsinghua University
Causal Forcing++ distills bidirectional video diffusion into a 1-2 step frame-wise autoregressive generator at 14.1 FPS, halves first-frame latency, and cuts few-step training cost ~4x (11,600 to 2,900 A800 GPU hours).
Code Generation · University of Waterloo
Code2LoRA trains a hypernetwork to emit a repo-specific LoRA adapter for a code model with no inference-time token cost — 66.2% in-repo and 63.8% cross-repo exact match, plus an Evo variant that tracks diffs with a GRU.
DeepSeek-V3 is a 671B-parameter MoE model that activates only 37B params per token, matches leading closed models on many benchmarks, and was pre-trained on 14.8T tokens for just 2.788M H800 GPU hours with open weights.
Efficient AI · Shanghai Jiao Tong University
Domino lets a parallel drafter propose a whole block at once, then a lightweight head adds back the token-to-token dependencies — reaching up to 5.49x speedup on Transformers and 5.8x throughput on SGLang.
FlashMemory-DeepSeek-V4 keeps only the KV chunks a neural indexer predicts you will need, shrinking physical KV cache to 13.5% of full-context decoding while accuracy stays flat or edges up ~0.6%.
Efficient AI · Alibaba Qwen Team
RTPurbo converts a trained full-attention LLM into a sparse one with about 600+600 adaptation steps, keeping LongBench accuracy (54.24 vs 53.80) while hitting 9.36x prefill speedup at 1M context.
Efficient AI · Sapient Intelligence
HRM-Text trains a 1B language model from scratch on 40B tokens for about $1,500, scoring 60.7% MMLU, 84.5% GSM8K and 56.2% MATH by swapping Transformers for a hierarchical recurrent model.
KVarN compresses the KV-cache to 2 bits with no calibration data, using a Hadamard rotation plus dual-axis variance normalization to stop quantization errors from snowballing across long reasoning chains.
World Models · Microsoft Research
Mirage stores a video world model's 3D memory inside diffusion latent space instead of an RGB point cloud, hitting state-of-the-art WorldScore (70.36) while running 10.57x faster and using 55x less GPU memory.
Text-to-Image · Microsoft Research
Microsoft's Lens is a 3.8B-parameter text-to-image diffusion model that matches 6B+ rivals while using about 19.3% of Z-Image's training compute, mostly by feeding it longer, denser captions.
LongLive-2.0 runs a 5B long-video model end to end in NVFP4 4-bit, hitting 45.7 FPS at 720p, 2.1x faster training and 1.84x faster inference, while VBench total drops only ~0.5 points from BF16.
LocateAnything emits each bounding box in a single decoding step instead of digit-by-digit, hitting 12.7 boxes/sec in hybrid mode — about 2.5x faster than Rex-Omni-3B — while leading on COCO and LVIS at the same 3B size.
Efficient AI · Microsoft Research
LoRA freezes a pretrained model and trains tiny low-rank matrices per layer instead — cutting trainable parameters up to 10,000x and GPU memory 3x versus full GPT-3 175B fine-tuning, with no extra latency.
Long Context · Shanghai AI Laboratory
δ-mem bolts a tiny 8×8 delta-rule memory onto a frozen LLM and lifts average long-memory scores 1.10× over the backbone and 1.15× over other memory methods — no fine-tuning, no context extension.
Fine-Tuning & Adaptation · Mind Lab
MinT keeps one frontier base model resident and swaps only LoRA adapters, cutting the model-handoff step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while addressing million-scale adapter catalogs.
Mistral 7B is a 7-billion-parameter open model that outperforms Llama 2 13B on every benchmark tested, uses grouped-query and sliding-window attention for cheap inference, and ships under Apache 2.0.
Retrieval-Augmented Generation · AIRI
OCC-RAG is a pair of 0.6B and 1.7B reasoning models trained to answer strictly from the given context and refuse when the answer isn't there — matching or beating general models 2-6x their size on multi-hop QA.
Fine-Tuning & Adaptation · Mind Lab
A position paper reframing LoRA adapters as persistent personal state, not a cheap full-finetune substitute, across three axes: scale up the base, scale down the adapter, scale out to millions, plus a serving stack MinT.
Efficient AI · Microsoft Research
Phi-3-mini is a 3.8B-parameter model trained on 3.3T heavily filtered and synthetic tokens that hits 69% on MMLU and 8.38 on MT-bench — matching Mixtral 8x7B and GPT-3.5 while small enough to run on a phone.
Diffusion Models · University of Science and Technology of China
Stream-T1 adds test-time search to streaming video generation without retraining, lifting VideoAlign motion quality from 0.350 to 0.629 at 5s and cutting the drift that wrecks 30-second clips.
Mixture of Experts · Google Research
Switch Transformer simplifies Mixture-of-Experts by routing each token to a single expert, hitting up to 7x faster T5 pretraining at fixed compute and scaling to 1.6 trillion parameters with bfloat16 training.
Fine-Tuning & Adaptation · T-Tech
On-policy distillation wastes teacher supervision on a student's weak early rollouts. TRB blends teacher-like behavior inside a KL trust region during warmup, then anneals it to zero — best average on two math settings.
Efficient AI · Stanford University
FlashAttention is an exact attention algorithm that uses tiling and recomputation to cut GPU memory traffic, delivering 3x on GPT-2, 15% on BERT-large, and linear memory in sequence length.
Sequence Modeling · Carnegie Mellon University
Mamba makes state space model parameters depend on the input, so it selectively remembers or forgets tokens. It scales linearly, runs 5x faster than Transformers, and Mamba-3B matches Transformers twice its size.