Kwai Keye-VL-2.0: Open Long-Video Multimodal Agent Model

Quick answer

Kwai Keye-VL-2.0-30B-A3B is an open-source multimodal MoE model aimed at long video understanding and agentic tasks. It activates about 3B parameters, adapts DeepSeek Sparse Attention to a GQA-based multimodal architecture, and trains for 256K context. The paper’s strongest evidence is long-video and temporal grounding: Keye-VL-2.0 reports 74.1 on LongVideoBench and leads all three TimeLens subsets in the comparison table.

What the model is trying to solve

Hour-level video creates two problems at once. The model must find the right frames inside a long stream, and it must reason across events that may be far apart. Dense attention makes the context expensive, while frame sampling can miss the evidence. Keye-VL-2.0’s answer is a sparse long-context multimodal stack plus a staged training and RL pipeline.

The architecture is a 30B-class MoE with only 3B active parameters. That matters for deployment: the paper is not presenting a giant closed model, but an open model meant to run with controlled cost.

Sparse attention and training recipe

The technical move is adapting DeepSeek Sparse Attention to a GQA-based multimodal model. The paper pairs this with Chunk ViT processing, ViT-LM heterogeneous parallelism, custom DSA kernels, and decode optimizations. For 128K context, the report says DSA-specific optimization reduces prefill cost by over 3x and decode cost by over 5x compared with full attention.

Post-training is also central. Cross-Modal Multi-Teacher On-Policy Distillation tries to consolidate feedback from multiple teachers without catastrophic forgetting. Context-RL and Video-RL then target long-context retrieval, temporal grounding, and agentic behavior.

That combination is the paper’s real claim. Long-video models often lose general reasoning when optimized for perception, or lose temporal precision when optimized for broad instruction following. Keye-VL-2.0 tries to keep those objectives in one model by separating sparse long-context infrastructure from staged alignment and RL. The reported results make most sense when read through that systems lens.

Key results

Long video: Keye-VL-2.0 reaches 74.1 on LongVideoBench, above Qwen3-VL-235B-A22B Thinking at 70.5 in the table.
Video-MME-v2: it reports 35.3 or 42.4 accuracy for 64 or 512 frames, showing improvement from denser visual context.
TimeLens: it leads ActivityNet-TimeLens at 58.5, QVHighlights-TimeLens at 70.1, and Charades-TimeLens at 58.4.
Coding: it reports 64.2 on LiveCodeBench v6, 71.5 on OJBench, and 62.0 on SWE-bench Verified.
Tool use: it reports 82.6 on tau2-Bench and 33.1 on VitaBench, leading the listed comparison models on those rows.

What to believe, and what to discount

The credible part is the broad systems story: sparse long-context attention, video curriculum, temporal grounding data, and multimodal RL line up with the measured strengths. The paper is also unusually explicit about inference cost optimizations, which is important for hour-level video.

The part to read carefully is the giant benchmark surface. Technical reports with many tables can hide uneven weaknesses. Keye-VL-2.0 is strong on long video and temporal localization, but it is not uniformly best on every multimodal benchmark. Video-MME and MLVU still show closed or larger open baselines in the same range.

Limits and open questions

The model is open, but reproducibility depends on the released checkpoints, inference stack, frame sampling, sparse kernels, and long-context serving choices. A user running only the checkpoint may not reproduce the cost profile.

The second question is agentic transfer. SWE-bench Verified and tool-use numbers are valuable, but the model is primarily positioned as a multimodal foundation model. Real multimodal agents still depend on tool routing, memory, UI control, and verification outside the base model.

There is also a data transparency limit. The report describes dense captions, temporal grounding data, Context-RL data, and agent data, but users cannot fully audit whether the same mixture will be available or reproducible outside Kuaishou’s training pipeline.

FAQ

What is Kwai Keye-VL-2.0?

Kwai Keye-VL-2.0-30B-A3B is an open-source multimodal MoE model from Kuaishou. It targets long-video understanding, temporal grounding, coding, tool use, and multimodal agent tasks.

How long is Keye-VL-2.0’s context window?

The paper trains Keye-VL-2.0 for 256K context and uses DeepSeek Sparse Attention adapted to a GQA-based multimodal architecture to make long video processing cheaper.

What are Keye-VL-2.0’s strongest benchmark results?

The strongest evidence is long-video and temporal grounding: 74.1 on LongVideoBench, 58.5 on ActivityNet-TimeLens, 70.1 on QVHighlights-TimeLens, and 58.4 on Charades-TimeLens.

Is Keye-VL-2.0 better than Qwen3-VL?

It beats Qwen3-VL variants on several long-video and temporal grounding rows, but not every multimodal row. The fair claim is narrower: Keye-VL-2.0 is especially competitive for long-video and temporal localization at its active-parameter scale.

One line: Keye-VL-2.0 is worth covering because it connects sparse long-context engineering to open multimodal video agents, not just to text windows. Read the original paper on arXiv.