Agent Memory · National University of Singapore
EvoArena turns static agent tasks into evolving chains and finds current agents average only 39.6% accuracy; EvoMem adds patch memory and improves chain-level accuracy by 3.7 points.
Text-to-Image · The Chinese University of Hong Kong
InterleaveThinker adds planner and critic agents around frozen image generators, reaching 66.3 to 67.2 average on UEval and lifting FLUX.2-klein WISE from 0.47 to 0.73.
Vision-Language-Action · Zhejiang University
LabVLA trains a Qwen3-VL-4B backbone plus DiT action expert on laboratory workflows and reports 71.1% ID and 70.0% OOD success on LabUtopia.
Theorem Proving · MiniMax AI
MaxProof turns MiniMax-M3 into a generator, verifier, fixer, and ranker; with population-level test-time scaling it reports 35/42 on IMO 2025 and 36/42 on USAMO 2026.
Long Context · MiniMax AI
MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.
AI Agents · NVIDIA
SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.
AI Agents · Google DeepMind
Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.
AI Agents · Renmin University of China
Arbor stores research attempts in a persistent hypothesis tree, then admits changes only through held-out evaluation. It reports best held-out results on six AO tasks and 86.36% Any Medal on MLE-Bench Lite.
AI Agents · TokenRhythm Technologies
Claw-SWE-Bench evaluates OpenClaw-style coding-agent harnesses on 350 GitHub issue tasks. OpenClaw jumps from 19.1% to 73.4% Pass@1 with a full adapter.
Mixture of Experts · Renmin University of China
MPI redesigns MoE routers by aligning router rows with expert weight directions. On 11B MoE, average benchmark accuracy rises from 40.92 to 42.76 with only 0.2% training slowdown.
Multimodal Models · Kuaishou Technology
Kwai Keye-VL-2.0 is a 30B-A3B open MoE multimodal model with 256K context, strong long-video scores, and 62.0 on SWE-bench Verified.
World Models · Alibaba Qwen Team
ABot-Earth 0.5 uses satellite imagery to generate 3D Gaussian Splatting city scenes, reporting under 10 minutes per square kilometer and FID 16.1.
World Models · JD.com (Joy Future Academy)
When a camera revisits an old spot, block-wise state-space recurrence scored 69.0 open-domain VLM consistency vs 12.25 for the no-memory baseline; aggressive compression and spatial summaries mostly collapsed.
AI Agents · Independent Researcher
SpatialWorld: Interactive Spatial Reasoning for Agents turns interactive spatial reasoning into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Long Context · Tencent
FlashMemory-DeepSeek-V4 keeps only the KV chunks a neural indexer predicts you will need, shrinking physical KV cache to 13.5% of full-context decoding while accuracy stays flat or edges up ~0.6%.
World Models · Microsoft Research
Mirage stores a video world model's 3D memory inside diffusion latent space instead of an RGB point cloud, hitting state-of-the-art WorldScore (70.36) while running 10.57x faster and using 55x less GPU memory.
Video Generation · Nanjing University
CoVEBench: Can Video Editors Follow Complex Instructions? turns complex instruction following for video editing into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
World Models · Independent Researcher
AnchorWorld: Egocentric World Simulation for Embodied AI turns egocentric world simulation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Multimodal Models · Peking University
A survey that reframes long-video MLLMs as three abilities (watch, remember, reason), comparing against 11 prior surveys and organizing 100+ methods plus 5 application domains.
Speech Synthesis · Independent Researcher
MMAE: A Massive Benchmark for Audio Editing Models turns audio editing evaluation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Shanghai Jiao Tong University
SWE-Explore isolates the repo-exploration stage of coding agents over 848 issues. Agentic explorers crush BM25 (HitFile 0.65 vs 0.08), but line-level recall stalls at 0.15-0.20, and that gap is what limits repairs.
Fine-Tuning & Adaptation · HKUST
On-policy distillation does not sit between SFT and RLVR — it carves its own geometry. Its updates touch fewer weights, avoid principal directions, and lock into a narrow low-dimensional subspace early in training.
Text Embeddings · Renmin University of China
EmbFilter reads the LLM unembedding matrix as a lens, strips the subspace that ties text embeddings to high-frequency junk tokens, and lifts zero-shot retrieval while shrinking dimensions.
AI Agents · Independent Researcher
AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Independent Researcher
ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Text-to-Image · Independent Researcher
DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Shanghai Jiao Tong University
A hypernetwork compiles a textual skill into a LoRA adapter in one forward pass. On ALFWorld, LatentSkill lifts success by 21.4 points (seen) with 64.1% fewer prefill tokens.
AI Agents · Lehigh University
OpenSkill lets agents build skills and their own verifiers from the open web, hitting 43.6% on SkillsBench (+8.9 over the best baseline) with zero target-task answers.
AI Agents · Independent Researcher
SoCRATES: Evaluating Proactive LLM Mediation turns proactive mediation agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Independent Researcher
ToolMaze: When LLM Agents Must Replan After Tool Failures turns dynamic replanning after tool failures into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Code Generation · University of Waterloo
Code2LoRA trains a hypernetwork to emit a repo-specific LoRA adapter for a code model with no inference-time token cost — 66.2% in-repo and 63.8% cross-repo exact match, plus an Evo variant that tracks diffs with a GRU.
Vision-Language-Action · ETH Zurich
A position paper from ETH Zurich, Stanford and TU Darmstadt argues scaling VLA and world models is not enough — robots need four interfaces to turn unstructured human and video behaviour into grounded supervision.
AI Agents · UC Berkeley
Agents' Last Exam tests AI agents on 1,490 expert-built professional tasks across 55 digital industries; the hardest tier averages only 2.6% full pass.
Reinforcement Learning · Tsinghua University
CHERRL injects four known judge biases to reliably reproduce reward hacking in rubric RL; an agent reading only training logs pinned the onset with 11-step total interval error and missed none of six runs.
AI Agents · HKUST
StreamMA pipes each reasoning step to the next agent the moment it is written, not after the full chain. Across 8 benchmarks it gains +7.3 pp on average (max +22.4 pp on HMMT 2026) and runs up to 26.9x faster.
AI Agents · Independent Researcher
TIDE: Proactive Multi-Problem Discovery with Templates turns proactive problem discovery into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Multimodal Models · Independent Researcher
VideoKR: Knowledge-Intensive Video Understanding turns knowledge and reasoning in video understanding into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Video Generation · HKUST
Echo-Infinity is an autoregressive video model with a learnable evolving memory that compresses any-length history at constant cost, hitting 24-hour rollouts (over 1.3M frames) in real time at 18.5 FPS on an H100.
Multimodal Models · Skywork AI
The Audio Interaction Model runs a perceive-decide-respond loop so an audio LLM listens, decides if and when to reply, and answers on the fly — trained on StreamAudio-2M and competitive across 8 benchmarks.
Biomolecular Modeling · AIRI
GENEB probes frozen representations from 40 genomic foundation models across 100 tasks in 13 functional categories, and finds rankings flip across categories while extra parameters buy only modest, inconsistent gains.
Multimodal Models · Shanghai AI Laboratory
OVO-S-Bench: Streaming Spatial Intelligence in MLLMs turns streaming spatial intelligence into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
LLM Reasoning · Shanghai AI Laboratory
ThoughtFold trims the redundant reasoning of DeepSeek-R1-Distill-Qwen-7B by about 56% of tokens while keeping accuracy on AIME, MATH-500, and GPQA-Diamond intact, using a masked preference objective.
World Models · University of Macau
PF-OPSD teaches a Qwen3.5-9B MLLM to decide when to simulate the future with a video world model, verify the rollout, and fold it into its answer, lifting accuracy +10.6 and +10.9 points on two new QA benchmarks.
Robotics · Tsinghua University
Humanoid-GPT treats humanoid control like language modeling: a causal Transformer distilled from ~384 PPO experts on a 2-billion-frame corpus, 200x prior data. It hits 92.58 percent sim success, under 1.5ms.
Language Models · Google Research
Google Research argues LLMs need an offline sleep phase to turn short-term context into stable weights. With sleep, Qwen3-8B hits 79.2% on AIME-24 and a Transformer reaches 80% on ARC few-shot, beating SEAL.
Text-to-Image · Alibaba Qwen Team
Qwen-Image-Flash distills Qwen-Image-2.0 to 4 sampling steps for both text-to-image and editing. The Alibaba Qwen team shows the training recipe — data, teachers, task mix — matters as much as the distillation objective.
Multimodal Models · University of Washington
Imaginative Perception Tokens (IPT) make a VLM render a new viewpoint instead of reasoning in text — lifting multiview counting 3.4%, rivaling closed models on path tracing, while text chain-of-thought sometimes hurts.
Efficient AI · Huawei
KVarN compresses the KV-cache to 2 bits with no calibration data, using a Hadamard rotation plus dual-axis variance normalization to stop quantization errors from snowballing across long reasoning chains.
AI Agents · Nanjing University
TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.
AI Agents · University of Illinois Urbana-Champaign
Harness-1 is a 20B RL search agent that hands working memory to the environment, hitting 0.730 average curated recall and beating the next open subagent by +11.4 points.
AI Agents · Independent Researcher
K-BrowseComp: Korean Web-Browsing Agent Benchmark turns Korean-context web browsing agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Reinforcement Learning · Tianjin University
When you RL-tune an LLM across math, code, QA, and writing in sequence, math drops from 66.49 to 57.66 even though gradients look orthogonal. A short math refresh pulls it back to 66.04 without wrecking the other three.
Video Generation · Kuaishou Technology
Instead of asking a video model to reason directly, a VLM grades its in-progress frames and fine-tunes a per-instance LoRA. The trick lifts RULER-Bench from 46.4 to 68.2.
Multimodal Models · The Chinese University of Hong Kong
X-Stream is the first benchmark for watching several live video streams at once. The best model, Gemini 3 Pro, hits 49.6% versus a 91.84% human baseline, and proactive ability collapses below 21%.
Multimodal Models · NVIDIA
Cosmos 3 packs language, image, video, audio, and robot actions into one mixture-of-transformers model; NVIDIA reports it ranks first among open models on text-to-image, image-to-video, and RoboArena policy.
Fine-Tuning & Adaptation · Mind Lab
A position paper reframing LoRA adapters as persistent personal state, not a cheap full-finetune substitute, across three axes: scale up the base, scale down the adapter, scale out to millions, plus a serving stack MinT.
AI Agents · Ant Group
SkillAdaptor edits an agent's skill library from failed trajectories without touching model weights, lifting WebShop score +2.3 and PinchBench +1.5 over the frozen backbone.
Robotics · Independent Researcher
TVRBench: Can Models Move to a Target Viewpoint? turns active 3D viewpoint reproduction into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
LLM Reasoning · Samsung Research
TrOPD masks on-policy distillation to the tokens where the teacher is actually trustworthy, adding +3.06 to +3.52 average points over standard OPD on math, code, and STEM benchmarks with 1.5B-1.7B students.
Retrieval-Augmented Generation · AIRI
OCC-RAG is a pair of 0.6B and 1.7B reasoning models trained to answer strictly from the given context and refuse when the answer isn't there — matching or beating general models 2-6x their size on multi-hop QA.
World Models · Independent Researcher
Function2Scene: 3D Indoor Layout from Functional Specs turns functional 3D scene layout into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Diffusion Models · The Hong Kong Polytechnic University
GGT-100K: Generative Ground Truth for Image Restoration turns real-world image restoration data into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Long Context · Tsinghua University
Tsinghua's LongTraceRL mines distractors from real search-agent trajectories and adds entity-level rubric rewards, lifting a Qwen3-4B reasoner from 53.3 to 59.0 average across five long-context benchmarks (+5.7).
AI Agents · Independent Researcher
When Masking Stale Observations Helps Search Agents turns context management for search agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Reinforcement Learning · University of Edinburgh
SCOPE co-evolves a task-writing Challenger and a retrieval Solver, judged by a frozen copy of the base model, lifting eight open-ended benchmarks by up to +10.4 points with zero curated prompts.
Speech Synthesis · Zhejiang University
SwanSphere streams first-order ambisonic audio synced to video or text, emitting its first chunk in 0.21s while cutting Frechet Distance to 120.28 vs OmniAudio's 157.67. Quality without waiting for the whole clip.
Agent Memory · ByteDance
TaskMem trains a multimodal agent to write its own memory with RL, lifting streaming-video QA accuracy to 67.9% on VideoMME and 45.4% on EgoLife, gains of 6.3 and 7.0 points over the Qwen3-VL-30B baseline.
Mixture of Experts · National University of Singapore
dMoE aligns token-level MoE routing with block-parallel decoding in diffusion LLMs. On LLaDA2.0-mini it cuts unique experts per block from 69.5 to 14.6, keeps 99.11% accuracy, and frees 76-80% of expert memory.
AI Agents · Shanghai AI Laboratory
COLLEAGUE.SKILL distills one person's work traces into a versioned skill package with two tracks — capability and bounded behavior — that any agent can install, correct, and roll back. The open repo reports ~18.5k stars.
Code Generation · JetBrains
Mellum 2 is JetBrains' open-weight 12B Mixture-of-Experts code model that activates only 2.5B parameters per token, matching dense 4B-14B baselines on software tasks at a fraction of the per-token compute.
Multimodal Models · ByteDance
Representation Forcing drops the frozen VAE from unified multimodal models. RF-Pixel predicts visual representation tokens before pixels, hits 0.84 GenEval, and lifts MMMU by 4.3 points over its VAE variant.
Speech Synthesis · ByteDance
SwanVoice is a zero-shot TTS system that generates an entire 1-4 speaker conversation in one pass, keeping voice, mood, and prosody consistent across turns where turn-by-turn synthesis drifts — but content accuracy lags.
Fine-Tuning & Adaptation · T-Tech
On-policy distillation wastes teacher supervision on a student's weak early rollouts. TRB blends teacher-like behavior inside a KL trust region during warmup, then anneals it to zero — best average on two math settings.
Language Models · Independent Researcher
Averaging the output distributions of 3 independent LLMs collapses watermark detection z-scores from 5-300 down below 2, and the WASH paper proves why it works with an O(1/sqrt(N)) error bound.
AI Agents · Shanghai AI Laboratory
ResearchClawBench: Testing Autonomous Research Agents turns end-to-end scientific research agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Xiamen University
SAAS uses self-aware RL to cut a Qwen2.5-7B search agent's average queries from 2.19 to 0.97 per question, while keeping accuracy near the best baseline (48.7% vs 49.8%).
Efficient AI · Shanghai AI Laboratory
Draft-OPD trains speculative draft models on states their own drafting induces, not just target transcripts. On Qwen3 thinking models it hits 4.86x to 4.89x, beating EAGLE-3 by 23 percent and DFlash by 13 percent.
Video Generation · NVIDIA
SANA-Streaming edits 1280x704 video in real time at 24 end-to-end FPS on a single RTX 5090, with the diffusion transformer core hitting 58 FPS via a hybrid DiT and Cycle-Reverse Regularization.
Video Generation · Virginia Tech
VideoMLA ports Multi-Head Latent Attention into causal video diffusion, cutting per-token KV memory 92.7% (224 vs 3,072 scalars), winning VBench at 60s, and lifting B200 throughput 1.23x.
Multimodal Models · Meta AI
VLM3 shows a standard 4B vision-language model matches expert 3D models — 0.904 depth accuracy, 94.0% camera-pose AUC, 91.35% object-3D accuracy — with no 3D-specific architecture, only focal unification and scaling.
AI Agents · Shanghai AI Laboratory
AgentDoG 1.5 trains 0.8B-8B agent-safety guard models on only ~1k samples, hits 92.2% accuracy on R-Judge with the 4B variant, rivals GPT-5.4, and cuts agentic-RL deployment overhead by two orders of magnitude.
Multimodal Models · University of Illinois Urbana-Champaign
Crafter wraps an image model in five cooperating agents and scores 50.34 on PaperBanana-Bench vs 11.13 for the raw backbone — then CraftEditor turns the raster output into editable SVG you can actually fix.
Efficient AI · Shanghai Jiao Tong University
Domino lets a parallel drafter propose a whole block at once, then a lightweight head adds back the token-to-token dependencies — reaching up to 5.49x speedup on Transformers and 5.8x throughput on SGLang.
Retrieval-Augmented Generation · University of Massachusetts Amherst
GrepSeek trains an LLM to answer questions by issuing shell commands like grep against the raw corpus — no embedding index — and posts the best F1 and Exact Match across seven open-domain QA benchmarks.
Vision-Language-Action · Alibaba Qwen Team
Qwen-VLA extends Qwen's vision-language stack with a DiT action decoder and embodiment-aware prompts to run manipulation, navigation, and trajectory prediction in one model — 97.9% on LIBERO and 69.0% OSR on R2R.
Speech Synthesis · Independent Researcher
A Broad Benchmark for Long-Form Speech Generation turns long-form speech generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Independent Researcher
TASTE: Harder Agent Benchmarks from Tool Sequences turns tool-use benchmark generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
World Models · NVIDIA
Gamma-World is NVIDIA's video world model for multiplayer simulation that runs at 24 FPS and generalizes from two to four players with no retraining, cutting Solaris's FVD roughly in half.
Fine-Tuning & Adaptation · The Hong Kong Polytechnic University
Teachability-Aware OPD supervises only ~5% of tokens, those where the teacher's correction lands inside the student's top-K support, matching or beating full-token distillation (44.89 vs 42.37 on Qwen3-4B to 1.7B).
Multimodal Models · NVIDIA
LocateAnything emits each bounding box in a single decoding step instead of digit-by-digit, hitting 12.7 boxes/sec in hybrid mode — about 2.5x faster than Rex-Omni-3B — while leading on COCO and LVIS at the same 3B size.
LLM Reasoning · Alibaba Qwen Team
DVAO weights each reward by its in-group variance instead of fixed coefficients, lifting Qwen3-4B-Base from 38.99% to 42.19% average accuracy and length compliance to 99.91% in math-plus-tool-use RL.
World Models · Fudan University
WBench scores interactive video world models on five axes — quality, setting, interaction, consistency, physics — across 289 cases and 1,058 turns, and finds no single model wins on all five.
Language Models · Xiaohongshu
NITP adds a dense target to next-token prediction: forecast a shallow-layer embedding of the next token. On a 9B MoE it lifts MMLU-Pro by 5.71 points for about 2 percent extra training FLOPs and zero inference cost.
Brain Decoding · MIT
BrainCause uses text-to-image generation plus an fMRI encoder to causally test what brain regions represent, cutting false-positive localizations from 73.4% to 23% across 260 visual concepts.
AI Agents · Microsoft Research
SkillOpt trains a single skill document for a frozen LLM agent with bounded add/delete/replace edits and a held-out gate, lifting GPT-5.5 by +23.5 points in direct chat across six benchmarks.
Theorem Proving · Google DeepMind
This work evaluates AI-aided formal proof search on open math problems: the strongest agent resolves 9 of 353 Erdos problems and proves 44 of 492 OEIS conjectures.
Multimodal Models · The University of Tokyo
MM-OCEAN tests whether multimodal LLMs justify Big Five personality ratings with real video evidence. Across 27 models, 51.3% of correct ratings rest on wrong cues, and the best grounds only 33.5% fully.
Language Models · Alibaba Qwen Team
TransitLM is a 13M-record corpus from four Chinese cities (120,845 stations) that trains a language model to plan transit routes with no map engine — a 4B model hits 97.0% connectivity and 71.0% exact match.
LLM Reasoning · Renmin University of China
DelTA reweights RLVR updates so credit lands on tokens that actually separate right answers from wrong ones, lifting Qwen3-8B-Base by 3.26 and Qwen3-14B-Base by 2.62 average points over the strongest baselines.
Efficient AI · Sapient Intelligence
HRM-Text trains a 1B language model from scratch on 40B tokens for about $1,500, scoring 60.7% MMLU, 84.5% GSM8K and 56.2% MATH by swapping Transformers for a hierarchical recurrent model.