RLDX-1: A Multi-Stream Vision-Language-Action Model for Dexterous Robots

Quick answer

RLDX-1 is a vision-language-action (VLA) model from RLWRLD and KAIST that adds three things most VLAs lack: motion awareness, a long-term memory of recent scene understanding, and a physics stream that predicts tactile and torque signals. On the hardest real-robot test reported — catching fast-moving objects with the ALLEX humanoid hand — it succeeds 87.5% of the time versus 29.2% for the pi0.5 baseline. On the LIBERO-Plus simulation benchmark it scores 86.7%, ahead of GR00T N1.6 at 72.6%.

Why a standard VLA is not enough for dexterous hands

Most VLAs map an image plus a language instruction to a robot action, and that is fine for slow pick-and-place on a parallel gripper. It breaks down on dexterous, contact-rich, time-critical tasks. A single still frame cannot tell you how fast an object is moving toward the hand. A stateless policy forgets what it saw two seconds ago, so it cannot reason over a sequence. And a vision-only model is blind to contact force, which is exactly the signal a multi-fingered hand needs to grasp without crushing or dropping. RLDX-1 is an attempt to bolt all three missing senses onto a modern VLA rather than redesign one from scratch.

How the Multi-Stream Action Transformer works

The backbone is Qwen3-VL 8B, extended with a motion module that computes space-time self-similarity at layer 9 of the vision encoder, giving the model an explicit sense of how the scene is moving rather than a frozen snapshot. The vision-language model emits “cognition tokens” — compressed scene understanding — and a memory module keeps a queue of the three most recent cognition features so the policy can reason over short horizons instead of acting frame-by-frame.

The action head is the Multi-Stream Action Transformer (MSAT): it runs a cognition stream, an action stream, and a physics stream through joint self-attention. The physics stream is the unusual part — it predicts future tactile and torque signals, forcing the model to anticipate contact for grasping and manipulation. Training runs in three stages: pre-training on 1.5M episodes across diverse robot embodiments (100K steps on 64 H200 GPUs), embodiment-specific mid-training (25K steps), then task-specific post-training with optional reinforcement learning. A separate synthetic-data pipeline generates rare manipulation scenarios with video generation plus quality and motion-consistency filtering.

Key results

All numbers are from the technical report; baselines are pi0.5 (written π₀.₅) and GR00T N1.6.

Catching fast-moving objects (ALLEX humanoid): 87.5%, vs 29.2% for pi0.5 — the largest reported gap and the clearest evidence the motion stream pays off.
Object-in-box selection (ALLEX): 91.7%, vs roughly 30% for both pi0.5 and GR00T N1.6.
LIBERO-Plus (simulation): 86.7%, vs 72.6% for GR00T N1.6.
LIBERO: 97.8%, vs 96.7% for GR00T N1.6 — near-saturated, so the gain is small here.
SIMPLER WidowX: 71.9%, vs 46.9% for pi0.5.
SIMPLER Google-VM: 81.5%, vs 72.7% for pi0.5.
Real-world OpenArm, unseen object / unseen task: 54.2% on both, vs 37.5% and 45.8% for pi0.5.
Inference: static-graph conversion gives a 1.63x speedup, cutting per-step latency from 71.2 ms to 43.7 ms on an RTX 5090.

The honest read: on saturated benchmarks like LIBERO the margin is a rounding error, but on dynamic, contact-heavy tasks — catching, object selection, LIBERO-Plus — RLDX-1 opens a real gap, which is the regime the architecture was built for.

Why this matters now

The VLA field has converged on a similar template (a VLM backbone plus an action expert), and pi0.5 and GR00T are the reference points everyone benchmarks against. RLDX-1’s contribution is to argue that the next gains come not from a bigger backbone but from feeding the policy the senses a hand actually uses: motion, memory, and force. The physics stream predicting tactile and torque signals is the most transferable idea here — it is a concrete recipe other labs can copy. The fact that the big wins show up specifically on catching and contact-rich selection, not on static tabletop benchmarks, is the strongest support for that thesis.

Limits and open questions

This is a technical report, not a peer-reviewed paper, and it leads with wins rather than failure modes — there is no dedicated limitations section, which is itself a caveat. The standout numbers come from RLWRLD’s own ALLEX and OpenArm setups, so the catching and selection results are hard to reproduce without that exact hardware. The model is not open: there are no released weights or code in the report, so the comparisons to pi0.5 and GR00T rest on the authors’ own evaluation harness. Real-time claims are tied to an RTX 5090 — a high-end GPU, not an on-robot edge budget. And training cost is non-trivial: 64 H200 GPUs across two pre-training stages puts a from-scratch reproduction out of reach for most academic labs.

FAQ

What is RLDX-1?

RLDX-1 is a vision-language-action model for dexterous and humanoid robots from RLWRLD and KAIST. It pairs a Qwen3-VL 8B backbone with a Multi-Stream Action Transformer that adds motion awareness, short-term memory, and a physics stream predicting tactile and torque signals.

How is RLDX-1 different from pi0.5 and GR00T N1.6?

RLDX-1 adds three senses standard VLAs lack — explicit motion, a memory queue of recent scene features, and force prediction. The payoff is concentrated in dynamic and contact-rich tasks: it catches fast-moving objects 87.5% of the time vs 29.2% for pi0.5, and scores 86.7% on LIBERO-Plus vs 72.6% for GR00T N1.6.

Is RLDX-1 open source?

The technical report does not release weights or code, so the model is not openly available. The benchmark comparisons are run on the authors’ own evaluation setup, including the ALLEX humanoid hand and OpenArm.

What hardware does RLDX-1 need?

Training used 64 H200 GPUs across pre-training and mid-training. For inference, the report measures 43.7 ms per step on a single RTX 5090 after a 1.63x static-graph speedup, so real-time control assumes a high-end desktop GPU rather than embedded hardware.

One line: feed a VLA the senses a hand actually uses — motion, memory, and force — and the gains show up exactly where it counts, on catching and contact-rich manipulation. Read the original technical report on arXiv.