Topics

Video Generation

Models that synthesize video from text or other conditions, including streaming and autoregressive diffusion approaches.

World Models · Alibaba Qwen Team

ABot-Earth 0.5: Generating 3D Cities From Satellite Images

ABot-Earth 0.5 uses satellite imagery to generate 3D Gaussian Splatting city scenes, reporting under 10 minutes per square kilometer and FID 16.1.

Video Generation · Nanjing University

CoVEBench: Can Video Editors Follow Complex Instructions?

CoVEBench: Can Video Editors Follow Complex Instructions? turns complex instruction following for video editing into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

World Models · JD.com (Joy Future Academy)

Echo-Memory: Which Memory Lets a World Model Remember a Room?

When a camera revisits an old spot, block-wise state-space recurrence scored 69.0 open-domain VLM consistency vs 12.25 for the no-memory baseline; aggressive compression and spatial summaries mostly collapsed.

Multimodal Models · Independent Researcher

VideoKR: Knowledge-Intensive Video Understanding

VideoKR: Knowledge-Intensive Video Understanding turns knowledge and reasoning in video understanding into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Video Generation · Kuaishou Technology

VLM Teachers Score Video-Model Reasoning at Test Time

Instead of asking a video model to reason directly, a VLM grades its in-progress frames and fine-tunes a per-instance LoRA. The trick lifts RULER-Bench from 46.4 to 68.2.

Video Generation · HKUST

Echo-Infinity: Learnable Evolving Memory for 24-Hour Real-Time Video

Echo-Infinity is an autoregressive video model with a learnable evolving memory that compresses any-length history at constant cost, hitting 24-hour rollouts (over 1.3M frames) in real time at 18.5 FPS on an H100.

Video Generation · NVIDIA

SANA-Streaming: Real-time Video Editing at 24 FPS on One RTX 5090

SANA-Streaming edits 1280x704 video in real time at 24 end-to-end FPS on a single RTX 5090, with the diffusion transformer core hitting 58 FPS via a hybrid DiT and Cycle-Reverse Regularization.

Video Generation · Virginia Tech

VideoMLA: A Low-Rank Latent KV Cache for Minute-Scale Video Diffusion

VideoMLA ports Multi-Head Latent Attention into causal video diffusion, cutting per-token KV memory 92.7% (224 vs 3,072 scalars), winning VBench at 60s, and lifting B200 throughput 1.23x.

Diffusion Models · University of Science and Technology of China

Stream-R1: Reliability-Perplexity Aware Reward Distillation Explained

Stream-R1 reweights DMD losses by video reward scores and per-region perplexity instead of treating signals equally. Its 1.3B streaming model hits 84.40 VBench at 23.1 FPS, beating its 14B teacher's 84.26 for free.