MiniMax Sparse Attention: 1M Context Without Dense Attention
MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.
Topics
Attention-based architectures that became the backbone of modern language and multimodal models.
MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.
Mask2Former uses masked attention to unify semantic, instance, and panoptic segmentation, reaching 57.8 PQ on COCO panoptic and 57.7 mIoU on ADE20K.
NITP adds a dense target to next-token prediction: forecast a shallow-layer embedding of the next token. On a 9B MoE it lifts MMLU-Pro by 5.71 points for about 2 percent extra training FLOPs and zero inference cost.
Language Models · Google Research
BERT pretrains a deep bidirectional Transformer encoder with masked language modeling, then fine-tunes with one extra layer — pushing GLUE to 80.5% and topping 11 NLP tasks.
Language Models · Google DeepMind
DeepMind's Chinchilla shows model size and training tokens should scale equally. A 70B model on ~1.4T tokens beats Gopher 280B, GPT-3 175B, and MT-NLG 530B.
Efficient AI · Shanghai Jiao Tong University
Domino lets a parallel drafter propose a whole block at once, then a lightweight head adds back the token-to-token dependencies — reaching up to 5.49x speedup on Transformers and 5.8x throughput on SGLang.
FlashMemory-DeepSeek-V4 keeps only the KV chunks a neural indexer predicts you will need, shrinking physical KV cache to 13.5% of full-context decoding while accuracy stays flat or edges up ~0.6%.
Efficient AI · Alibaba Qwen Team
RTPurbo converts a trained full-attention LLM into a sparse one with about 600+600 adaptation steps, keeping LongBench accuracy (54.24 vs 53.80) while hitting 9.36x prefill speedup at 1M context.
GPT-3 is a 175B-parameter autoregressive language model that performs translation, QA, and reasoning tasks from a few in-prompt examples, with no gradient updates or task-specific fine-tuning.
Diffusion Models · Independent Researcher
Very deep DiTs collapse into a mean-dominated state the author calls Mean Mode Screaming. Splitting the residual into mean and centered paths fixes it, training a stable 1000-layer DiT to FID 2.77.
Language Models · Google Research
PaLM is a 540-billion-parameter dense Transformer trained on 6,144 TPU v4 chips with Pathways. It hit breakthrough few-shot results and beat average human scores on BIG-bench.
Diffusion Models · Alibaba Qwen Team
DAR replaces the residual add in diffusion transformers with timestep-adaptive aggregation of past sublayer outputs, cutting SiT-XL/2's ImageNet FID from 9.67 to 7.56 with 8.75x fewer iterations.
Language Models · Google Research
T5 reframes every NLP task as text-in, text-out, then runs a systematic sweep over objectives, architectures, data, and scale. The 11B model set state of the art on GLUE, SuperGLUE, and SQuAD.
Text Embeddings · Renmin University of China
EmbFilter reads the LLM unembedding matrix as a lens, strips the subspace that ties text embeddings to high-frequency junk tokens, and lifts zero-shot retrieval while shrinking dimensions.
Vision Foundation Models · Google Research
ViT shows a plain Transformer fed raw 16x16 image patches beats top CNNs once pre-trained on JFT-300M, reaching 88.55% on ImageNet while using far less training compute.
Transformers · Google Research
The 2017 Transformer dropped recurrence and convolution for pure attention, hit 28.4 BLEU on WMT14 EN-DE and 41.8 on EN-FR, and trained in 3.5 days on 8 GPUs. Nearly every modern LLM inherits it.
Efficient AI · Stanford University
FlashAttention is an exact attention algorithm that uses tiling and recomputation to cut GPU memory traffic, delivering 3x on GPT-2, 15% on BERT-large, and linear memory in sequence length.