Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

Authors: Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, Alan L. Yuille

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that our Spatial Reasoner with explicit 3D representations can significantly enhance 3D spatial reasoning abilities of LVLMs and generalize to novel question types. Besides improving spatial reasoning performance on a variety of benchmarks, we experiment on various LVLMs fine-tuned with different combinations of data and training methods to study the key factors toward improved 3D spatial reasoning. Our empirical results lead to the following insights:...
Researcher Affiliation	Collaboration	Wufei Ma Yu-Cheng Chou Qihao Liu Xingrui Wang Celso M de Melo Jianwen Xieo Alan Yuille Johns Hopkins University, DEVCOM Army Research Laboratory, o Lambda Inc
Pseudocode	No	The paper describes the methodology and training strategies in detail across sections like '3 Spatial Reasoner' and its subsections (e.g., '3.2 Learning Explicit 3D Representations', '3.3 Training Strategies'), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	All code, data, and models will be available on our project page to support reproducibility and benefit the research community. 1. Codebase for our full 3D-aware data generation pipeline. 2. Codebase for our SFT and RL finetuning. 3. Synthesized 3D-aware training data. 4. Weights of our Spatial Reasonerand Spatial Reasoner-SFT.
Open Datasets	Yes	We evaluate spatial reasoning abilities of various models on three spatial reasoning benchmarks. 3DSRBench [31] is a comprehensive 3D spatial reasoning benchmark... CVBench [45] is a vision-centric benchmark... GQA [19] is a widely adopted benchmark... Our process begins with generating 3D pseudo-annotations... on unlabeled images from the Open Images dataset [22].
Dataset Splits	Yes	Starting from the Qwen2.5-VL-7B [4] base model, we first apply SFT with 24k curated SR-Co T data alongside 24k randomly sampled LLa VA [27] data, resulting in Spatial Reasoner-SFT. Next, we further train Spatial Reasoner-SFT with RL using 1.2k SR-QA examples, leading to our final Spatial Reasoner.
Hardware Specification	Yes	We conduct all training experiments using 4 NVIDIA H100 80GB HBM3 GPUs. ...using 1 GPU with v LLM [23] for efficient inference acceleration.
Software Dependencies	Yes	Starting from the Qwen2.5-VL-7B [4] base model, we first apply SFT... using 1 GPU with v LLM [23] for efficient inference acceleration.
Experiment Setup	Yes	For SFT, we train the model for 10 epochs (approximately 20K steps with a batch size of 6) on the combined 24k SR-Co T and 24k LLa VA datasets. For RL training, we train for 100 epochs (approximately 13K steps with a batch size of 12)... We set the learning rate to 5e-6 for SFT and 5e-7 for RL, both following a cosine learning rate scheduler with a warm-up ratio of 0.1. In the KL divergence ablation study, we set the KL penalty weight to 0.04.