Topics

Transformers

Attention-based architectures that became the backbone of modern language and multimodal models.

MiniMax Sparse Attention: 1M Context Without Dense Attention

MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.

Segmentation · Meta AI

Mask2Former: One Transformer for Segmentation Tasks

Mask2Former uses masked attention to unify semantic, instance, and panoptic segmentation, reaching 57.8 PQ on COCO panoptic and 57.7 mIoU on ADE20K.

Language Models · Xiaohongshu

NITP: Predict the Next Token's Meaning, Not Just Its ID

NITP adds a dense target to next-token prediction: forecast a shallow-layer embedding of the next token. On a 9B MoE it lifts MMLU-Pro by 5.71 points for about 2 percent extra training FLOPs and zero inference cost.

Language Models · Google Research

BERT Explained: Bidirectional Transformer Pretraining for NLP

BERT pretrains a deep bidirectional Transformer encoder with masked language modeling, then fine-tunes with one extra layer — pushing GLUE to 80.5% and topping 11 NLP tasks.

Language Models · Google DeepMind

Chinchilla: Why Compute-Optimal LLMs Beat Bigger Ones

DeepMind's Chinchilla shows model size and training tokens should scale equally. A 70B model on ~1.4T tokens beats Gopher 280B, GPT-3 175B, and MT-NLG 530B.

Efficient AI · Shanghai Jiao Tong University

Domino: Splitting the Draft and the Causal Fix in Speculative Decoding

Domino lets a parallel drafter propose a whole block at once, then a lightweight head adds back the token-to-token dependencies — reaching up to 5.49x speedup on Transformers and 5.8x throughput on SGLang.

Long Context · Tencent

FlashMemory-DeepSeek-V4: Cutting KV Cache to 13.5% for 500K Context

FlashMemory-DeepSeek-V4 keeps only the KV chunks a neural indexer predicts you will need, shrinking physical KV cache to 13.5% of full-context decoding while accuracy stays flat or edges up ~0.6%.

Efficient AI · Alibaba Qwen Team

Full Attention Strikes Back: RTPurbo Sparsifies LLMs in Hundreds of Steps

RTPurbo converts a trained full-attention LLM into a sparse one with about 600+600 adaptation steps, keeping LongBench accuracy (54.24 vs 53.80) while hitting 9.36x prefill speedup at 1M context.

Language Models · OpenAI

GPT-3 Explained: When the Prompt Became the Programming Interface

GPT-3 is a 175B-parameter autoregressive language model that performs translation, QA, and reasoning tasks from a few in-prompt examples, with no gradient updates or task-specific fine-tuning.

Diffusion Models · Independent Researcher

Mean Mode Screaming: Stabilizing 1000-Layer Diffusion Transformers

Very deep DiTs collapse into a mean-dominated state the author calls Mean Mode Screaming. Splitting the residual into mean and centered paths fixes it, training a stable 1000-layer DiT to FID 2.77.

Language Models · Google Research

PaLM: Scaling a 540B Dense Language Model with Pathways

PaLM is a 540-billion-parameter dense Transformer trained on 6,144 TPU v4 chips with Pathways. It hit breakthrough few-shot results and beat average human scores on BIG-bench.

Diffusion Models · Alibaba Qwen Team

Rethinking Cross-Layer Information Routing in Diffusion Transformers

DAR replaces the residual add in diffusion transformers with timestep-adaptive aggregation of past sublayer outputs, cutting SiT-XL/2's ImageNet FID from 9.67 to 7.56 with 8.75x fewer iterations.

Language Models · Google Research

T5 Explained: One Text-to-Text Interface for Every NLP Task

T5 reframes every NLP task as text-in, text-out, then runs a systematic sweep over objectives, architectures, data, and scale. The 11B model set state of the art on GLUE, SuperGLUE, and SQuAD.

Text Embeddings · Renmin University of China

EmbFilter: Turning an LLM's UnEmbedding Matrix Into a Feature Lens

EmbFilter reads the LLM unembedding matrix as a lens, strips the subspace that ties text embeddings to high-frequency junk tokens, and lifts zero-shot retrieval while shrinking dimensions.

Vision Foundation Models · Google Research

Vision Transformer (ViT): An Image is Worth 16x16 Words

ViT shows a plain Transformer fed raw 16x16 image patches beats top CNNs once pre-trained on JFT-300M, reaching 88.55% on ImageNet while using far less training compute.

Transformers · Google Research

Attention Is All You Need: The Transformer Architecture Explained

The 2017 Transformer dropped recurrence and convolution for pure attention, hit 28.4 BLEU on WMT14 EN-DE and 41.8 on EN-FR, and trained in 3.5 days on 8 GPUs. Nearly every modern LLM inherits it.

Efficient AI · Stanford University

FlashAttention Explained: IO-Aware Exact Attention, 2-4x Faster

FlashAttention is an exact attention algorithm that uses tiling and recomputation to cut GPU memory traffic, delivering 3x on GPT-2, 15% on BERT-large, and linear memory in sequence length.