Topics
Vision-Language-Action
Models that map perception and language directly to robot actions.
Vision-Language-Action · Zhejiang University
LabVLA trains a Qwen3-VL-4B backbone plus DiT action expert on laboratory workflows and reports 71.1% ID and 70.0% OOD success on LabUtopia.
World Models · Independent Researcher
AnchorWorld: Egocentric World Simulation for Embodied AI turns egocentric world simulation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Robotics · Independent Researcher
TVRBench: Can Models Move to a Target Viewpoint? turns active 3D viewpoint reproduction into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Robotics · Tsinghua University
Humanoid-GPT treats humanoid control like language modeling: a causal Transformer distilled from ~384 PPO experts on a 2-billion-frame corpus, 200x prior data. It hits 92.58 percent sim success, under 1.5ms.
AI Agents · Shanghai Jiao Tong University
MMSkills packages textual procedures, runtime state cards, and keyframes into reusable skills for visual agents, lifting Qwen3-VL-235B from 21.34% to 39.17% on OSWorld and a small 8B model from 10.78% to 25.40%.
Vision-Language-Action · Allen Institute for AI
MolmoAct2 is an open vision-language-action stack that reasons in 3D before acting. On real-world DROID it hits 87.1% success, +38.7 points over the runner-up, and its Molmo2-ER brain beats GPT-5 and Gemini Robotics ER.
Vision-Language-Action · Shanghai AI Laboratory
PhysBrain 1.0 compiles human egocentric video into physics QA to pretrain a VLM, then adapts it to robot control — lifting Franka grasping from 47.1% to 63.3% over 50 trials versus a pi0.5 baseline.
Vision-Language-Action · Alibaba Qwen Team
Qwen-VLA extends Qwen's vision-language stack with a DiT action decoder and embodiment-aware prompts to run manipulation, navigation, and trajectory prediction in one model — 97.9% on LIBERO and 69.0% OSR on R2R.
Vision-Language-Action · RLWRLD
RLDX-1, from RLWRLD and KAIST, adds motion, memory and tactile streams to a Qwen3-VL backbone. It catches fast-moving objects 87.5% of the time vs 29.2% for pi0.5, and beats GR00T N1.6 on LIBERO-Plus 86.7% to 72.6%.
Vision-Language-Action · ETH Zurich
A position paper from ETH Zurich, Stanford and TU Darmstadt argues scaling VLA and world models is not enough — robots need four interfaces to turn unstructured human and video behaviour into grounded supervision.
Vision-Language-Action · Physical Intelligence
π0 bolts a flow-matching action expert onto a pretrained VLM, emitting ~50Hz action chunks so one policy can fold laundry, bus tables, and assemble boxes across single-arm, dual-arm, and mobile robots.
Vision-Language-Action · Google DeepMind
RT-2 co-fine-tunes a web-pretrained vision-language model on robot trajectories, expresses actions as text tokens, and gets emergent generalization to novel objects, unseen commands, and basic reasoning across 6k trials.