AI Agents · Multimodal Models · Vision Foundation Models
SpatialClaw: Why VLM Spatial Agents Need a Python Workspace
SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.
Quick answer
SpatialClaw argues that spatial reasoning agents fail partly because their action interface is too rigid. Instead of one-shot code or predefined JSON tool calls, it gives a VLM-backed agent a persistent Python kernel loaded with frames, perception tools, geometry utilities, variables, plots, and runtime feedback. Across 20 spatial reasoning benchmarks, SpatialClaw reports 59.9% average accuracy, compared with 48.7% for the recent spatial-agent baseline in the same table.
The idea: code as the action interface
Tool-augmented spatial agents usually expose perception modules through structured calls. That is clean, but it limits composition. If a task requires segmenting an object, combining it with a depth map, fitting a plane, checking a trajectory, and then revising the plan after seeing an odd mask, a fixed tool schema becomes a bottleneck.
SpatialClaw uses code differently. The agent writes one Python cell at a time. Each cell can call perception primitives, inspect intermediate outputs, define helper variables, plot results, handle errors, and condition the next cell on what happened. The kernel persists across steps, so the workspace becomes memory for spatial evidence.
This is a harness paper more than a model paper. The paper does not claim that the base VLM suddenly understands 3D. It claims that an iterative code workspace lets the VLM turn visual evidence into inspectable computations.
Key results
- Overall accuracy: SpatialClaw reaches 59.9% average accuracy across 20 benchmarks using the Gemma 4-31B backbone.
- Baseline gap: the recent spatial-agent baseline in the table is 48.7%, so the headline gap is +11.2 points.
- Interface comparison: with the same tools, no-tool gets 53.4, single-pass code 55.2, structured tool-call 56.7, and SpatialClaw 59.9.
- Strong categories: gains are largest in multi-view and video or 4D spatial reasoning, where intermediate geometric computation matters.
- MindCube and DSI-Bench: SpatialClaw reaches 72.8 on MindCube, versus 62.4 for structured tool calls and 52.9 for SpaceTools; on DSI-Bench it reaches 62.9, versus 58.4 structured and 43.0 SpaceTools.
- Tool ablation: removing utility functions barely changes the sampled average, 56.9 to 56.4, but removing perception tools drops it to 51.4.
Why persistent execution matters
Single-pass code has expressive power, but it commits before seeing evidence. Structured tool calls can expose perception models, but they require the interface designer to anticipate the composition. SpatialClaw’s persistent Python workspace avoids both problems: the agent can run a cell, see that a mask is wrong, change the operation, and carry variables forward.
That explains why the gains concentrate in camera motion, multi-view reasoning, relative direction, and temporal reasoning. Those tasks often need several small spatial operations rather than one semantic answer. The paper’s pairwise category analysis says SpatialClaw beats single-pass code and structured tool calls in 11 of 13 meta-categories.
What builders should copy
The practical takeaway is not “give agents every possible tool.” SpatialClaw works because it gives the model a programmable workbench with state. If you are building a visual agent, a small set of reliable perception primitives plus a persistent code environment may beat a large menu of narrow tool endpoints.
The second lesson is to evaluate the harness separately from the model. SpatialClaw tests multiple backbones and keeps the prompts, tools, and hyperparameters aligned across benchmarks. That makes the action-interface claim easier to believe.
Limits and open questions
SpatialClaw is training-free, but it is not cost-free. Multi-step code execution, perception calls, plots, and retries add latency and failure modes. A real robot or AR system would need stronger safety controls than a benchmark workspace.
The benchmark set is broad, yet many tasks still reduce to perception-plus-geometry questions. The paper does not prove that a VLM agent can build a physically valid plan in the world, only that the interface improves benchmark spatial reasoning under a controlled tool stack.
FAQ
What is SpatialClaw?
SpatialClaw is a training-free spatial reasoning agent from NVIDIA. It gives a VLM a persistent Python kernel with perception and geometry primitives, so the model can iteratively inspect and compute over visual evidence.
How much does SpatialClaw improve spatial reasoning?
On 20 spatial reasoning benchmarks, SpatialClaw reports 59.9% average accuracy with Gemma 4-31B, versus 48.7% for the recent spatial-agent baseline and 56.7% for a structured tool-call interface.
How much does SpatialClaw gain on MindCube and DSI-Bench?
On MindCube, SpatialClaw reports 72.8, compared with 62.4 for structured tool calls and 52.9 for SpaceTools. On DSI-Bench, it reports 62.9, compared with 58.4 structured and 43.0 SpaceTools.
Why does SpatialClaw use Python instead of tool calls?
Python lets the agent compose operations after seeing intermediate evidence. Structured tool calls are easier to validate, but they restrict test-time compositions that were not anticipated by the interface designer.
Does SpatialClaw prove VLMs understand 3D?
No. It shows that a better action interface helps VLM-backed agents use external spatial evidence. The improvement should be attributed to the harness plus tools, not only to the base model.
One line: SpatialClaw’s useful claim is that spatial agents need an inspectable workspace, not just more named tools. Read the original paper on arXiv.