Topics

Robotics

Learning and control for physical robots.

ABot-Earth 0.5: Generating 3D Cities From Satellite Images

ABot-Earth 0.5 uses satellite imagery to generate 3D Gaussian Splatting city scenes, reporting under 10 minutes per square kilometer and FID 16.1.

Vision-Language-Action · Zhejiang University

LabVLA: A VLA Model for Scientific Lab Robots

LabVLA trains a Qwen3-VL-4B backbone plus DiT action expert on laboratory workflows and reports 71.1% ID and 70.0% OOD success on LabUtopia.

World Models · Independent Researcher

AnchorWorld: Egocentric World Simulation for Embodied AI

AnchorWorld: Egocentric World Simulation for Embodied AI turns egocentric world simulation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

World Models · Independent Researcher

Function2Scene: 3D Indoor Layout from Functional Specs

Function2Scene: 3D Indoor Layout from Functional Specs turns functional 3D scene layout into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

SpatialWorld: Interactive Spatial Reasoning for Agents

SpatialWorld: Interactive Spatial Reasoning for Agents turns interactive spatial reasoning into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Robotics · Independent Researcher

TVRBench: Can Models Move to a Target Viewpoint?

TVRBench: Can Models Move to a Target Viewpoint? turns active 3D viewpoint reproduction into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Robotics · Tsinghua University

Humanoid-GPT: A GPT-Style Transformer for Humanoid Motion Tracking

Humanoid-GPT treats humanoid control like language modeling: a causal Transformer distilled from ~384 PPO experts on a 2-billion-frame corpus, 200x prior data. It hits 92.58 percent sim success, under 1.5ms.

Multimodal Models · NVIDIA

Cosmos 3 Explained: NVIDIA's Omnimodal World Model for Physical AI

Cosmos 3 packs language, image, video, audio, and robot actions into one mixture-of-transformers model; NVIDIA reports it ranks first among open models on text-to-image, image-to-video, and RoboArena policy.

Vision-Language-Action · Allen Institute for AI

MolmoAct2: An Open Action Reasoning Stack for Real Robots

MolmoAct2 is an open vision-language-action stack that reasons in 3D before acting. On real-world DROID it hits 87.1% success, +38.7 points over the runner-up, and its Molmo2-ER brain beats GPT-5 and Gemini Robotics ER.

Vision-Language-Action · Shanghai AI Laboratory

PhysBrain 1.0: Turning Human Video into Physical Priors for Robots

PhysBrain 1.0 compiles human egocentric video into physics QA to pretrain a VLM, then adapts it to robot control — lifting Franka grasping from 47.1% to 63.3% over 50 trials versus a pi0.5 baseline.

Vision-Language-Action · Alibaba Qwen Team

Qwen-VLA: One Model for Manipulation, Navigation, and Trajectories

Qwen-VLA extends Qwen's vision-language stack with a DiT action decoder and embodiment-aware prompts to run manipulation, navigation, and trajectory prediction in one model — 97.9% on LIBERO and 69.0% OSR on R2R.

Vision-Language-Action · RLWRLD

RLDX-1: A Multi-Stream Vision-Language-Action Model for Dexterous Robots

RLDX-1, from RLWRLD and KAIST, adds motion, memory and tactile streams to a Qwen3-VL backbone. It catches fast-moving objects 87.5% of the time vs 29.2% for pi0.5, and beats GR00T N1.6 on LIBERO-Plus 86.7% to 72.6%.

Vision-Language-Action · ETH Zurich

Robots Need More than VLA and World Models: Four Missing Interfaces

A position paper from ETH Zurich, Stanford and TU Darmstadt argues scaling VLA and world models is not enough — robots need four interfaces to turn unstructured human and video behaviour into grounded supervision.

Vision-Language-Action · Physical Intelligence

π0 Explained: A Vision-Language-Action Flow Model for Robots

π0 bolts a flow-matching action expert onto a pretrained VLM, emitting ~50Hz action chunks so one policy can fold laundry, bus tables, and assemble boxes across single-arm, dual-arm, and mobile robots.

Vision-Language-Action · Google DeepMind

RT-2 Explained: Vision-Language-Action Models for Robot Control

RT-2 co-fine-tunes a web-pretrained vision-language model on robot trajectories, expresses actions as text tokens, and gets emergent generalization to novel objects, unseen commands, and basic reasoning across 6k trials.