MiniMax Sparse Attention: 1M Context Without Dense Attention
MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.
Institution
NVIDIA's research arm, known for accelerated computing and AI work spanning generative models, world models, and robotics for Physical AI.
MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.
SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.
SANA-Streaming edits 1280x704 video in real time at 24 end-to-end FPS on a single RTX 5090, with the diffusion transformer core hitting 58 FPS via a hybrid DiT and Cycle-Reverse Regularization.
AnyFlow distills a video diffusion model that keeps improving as you add sampling steps, fixing the quality drop consistency-distilled models suffer at higher step counts. Tested on Wan2.1 from 1.3B to 14B.
Cosmos 3 packs language, image, video, audio, and robot actions into one mixture-of-transformers model; NVIDIA reports it ranks first among open models on text-to-image, image-to-video, and RoboArena policy.
Gamma-World is NVIDIA's video world model for multiplayer simulation that runs at 24 FPS and generalizes from two to four players with no retraining, cutting Solaris's FVD roughly in half.
LongLive-2.0 runs a 5B long-video model end to end in NVFP4 4-bit, hitting 45.7 FPS at 720p, 2.1x faster training and 1.84x faster inference, while VBench total drops only ~0.5 points from BF16.
LocateAnything emits each bounding box in a single decoding step instead of digit-by-digit, hitting 12.7 boxes/sec in hybrid mode — about 2.5x faster than Rex-Omni-3B — while leading on COCO and LVIS at the same 3B size.
MulTaBench is a 40-dataset benchmark (20 image-tabular, 20 text-tabular) where each task needs both the table and the image or text. Its finding: tuning embeddings to the target beats frozen embeddings on every learner.