Topics

Long Context

Models and evaluations for reasoning over very large text, audio, video, or code contexts.

Agent Memory · National University of Singapore

EvoArena: Why Agent Memory Must Track Environment Changes

EvoArena turns static agent tasks into evolving chains and finds current agents average only 39.6% accuracy; EvoMem adds patch memory and improves chain-level accuracy by 3.7 points.

Multimodal Models · Kuaishou Technology

Kwai Keye-VL-2.0: Open Long-Video Multimodal Agent Model

Kwai Keye-VL-2.0 is a 30B-A3B open MoE multimodal model with 256K context, strong long-video scores, and 62.0 on SWE-bench Verified.

Long Context · MiniMax AI

MiniMax Sparse Attention: 1M Context Without Dense Attention

MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.

Long Context · Tsinghua University

LongTraceRL: Harder Distractors and Rubric Rewards for Long-Context RL

Tsinghua's LongTraceRL mines distractors from real search-agent trajectories and adds entity-level rubric rewards, lifting a Qwen3-4B reasoner from 53.3 to 59.0 average across five long-context benchmarks (+5.7).

AI Agents · Independent Researcher

When Masking Stale Observations Helps Search Agents

When Masking Stale Observations Helps Search Agents turns context management for search agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · Shanghai AI Laboratory

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs turns streaming spatial intelligence into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Long Context · Tencent

FlashMemory-DeepSeek-V4: Cutting KV Cache to 13.5% for 500K Context

FlashMemory-DeepSeek-V4 keeps only the KV chunks a neural indexer predicts you will need, shrinking physical KV cache to 13.5% of full-context decoding while accuracy stays flat or edges up ~0.6%.

Efficient AI · Alibaba Qwen Team

Full Attention Strikes Back: RTPurbo Sparsifies LLMs in Hundreds of Steps

RTPurbo converts a trained full-attention LLM into a sparse one with about 600+600 adaptation steps, keeping LongBench accuracy (54.24 vs 53.80) while hitting 9.36x prefill speedup at 1M context.

Long Context · University of Illinois Urbana-Champaign

From Context to Skills: Ctx2Skill Self-Evolves Context Learning

Ctx2Skill is a self-play framework that discovers natural-language skills from a long context with no human labels or external rewards, lifting GPT-4.1 from 11.1% to 16.5% and GPT-5.1 from 21.2% to 25.8% on CL-bench.

Efficient AI · Huawei

KVarN: 2-Bit KV-Cache Quantization Without Calibration

KVarN compresses the KV-cache to 2 bits with no calibration data, using a Hadamard rotation plus dual-axis variance normalization to stop quantization errors from snowballing across long reasoning chains.

Long Context · Shanghai AI Laboratory

δ-mem: An 8×8 Online Memory That Boosts Frozen LLMs

δ-mem bolts a tiny 8×8 delta-rule memory onto a frozen LLM and lifts average long-memory scores 1.10× over the backbone and 1.15× over other memory methods — no fine-tuning, no context extension.

Long Context · Google DeepMind

Gemini 1.5: Near-Perfect Recall Across Millions of Tokens

Gemini 1.5 Pro and Flash keep >99% retrieval recall up to at least 10M tokens of text, video, and audio — and Pro matches Gemini 1.0 Ultra with far less compute.

Sequence Modeling · Carnegie Mellon University

Mamba: Selective State Spaces for Linear-Time Sequence Modeling

Mamba makes state space model parameters depend on the input, so it selectively remembers or forgets tokens. It scales linearly, runs 5x faster than Transformers, and Mamba-3B matches Transformers twice its size.