Institution

NVIDIA

NVIDIA's research arm, known for accelerated computing and AI work spanning generative models, world models, and robotics for Physical AI.

Long Context · MiniMax AI

MiniMax Sparse Attention: 1M Context Without Dense Attention

MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.

AI Agents · NVIDIA

SpatialClaw: Why VLM Spatial Agents Need a Python Workspace

SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.

Video Generation · NVIDIA

SANA-Streaming: Real-time Video Editing at 24 FPS on One RTX 5090

SANA-Streaming edits 1280x704 video in real time at 24 end-to-end FPS on a single RTX 5090, with the diffusion transformer core hitting 58 FPS via a hybrid DiT and Cycle-Reverse Regularization.

Diffusion Models · NVIDIA

AnyFlow: Any-Step Video Diffusion via Flow Map Distillation

AnyFlow distills a video diffusion model that keeps improving as you add sampling steps, fixing the quality drop consistency-distilled models suffer at higher step counts. Tested on Wan2.1 from 1.3B to 14B.

Multimodal Models · NVIDIA

Cosmos 3 Explained: NVIDIA's Omnimodal World Model for Physical AI

Cosmos 3 packs language, image, video, audio, and robot actions into one mixture-of-transformers model; NVIDIA reports it ranks first among open models on text-to-image, image-to-video, and RoboArena policy.

World Models · NVIDIA

Gamma-World: A Multi-Agent World Model That Scales Past Two Players

Gamma-World is NVIDIA's video world model for multiplayer simulation that runs at 24 FPS and generalizes from two to four players with no retraining, cutting Solaris's FVD roughly in half.

Diffusion Models · NVIDIA

LongLive-2.0: NVFP4 4-bit Training and Inference for Long Video

LongLive-2.0 runs a 5B long-video model end to end in NVFP4 4-bit, hitting 45.7 FPS at 720p, 2.1x faster training and 1.84x faster inference, while VBench total drops only ~0.5 points from BF16.

Multimodal Models · NVIDIA

LocateAnything: Parallel Box Decoding for Faster Vision Grounding

LocateAnything emits each bounding box in a single decoding step instead of digit-by-digit, hitting 12.7 boxes/sec in hybrid mode — about 2.5x faster than Rex-Omni-3B — while leading on COCO and LVIS at the same 3B size.

Multimodal Models · NVIDIA

MulTaBench: A 40-Dataset Benchmark for Multimodal Tabular Learning

MulTaBench is a 40-dataset benchmark (20 image-tabular, 20 text-tabular) where each task needs both the table and the image or text. Its finding: tuning embeddings to the target beats frozen embeddings on every learner.