Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

Authors: Zhao Jin, Rong-Cheng Tu, Jingyi Liao, Wenhao Sun, Xiao Luo, Shunyu Liu, Dacheng Tao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on Scan Refer and Nr3D benchmarks demonstrate that SPAZER significantly outperforms previous state-of-the-art zero-shot methods, achieving notable gains of 9.0% and 10.9% in accuracy. Our codes are available at https://github.com/JZ-9962/SPAZER.
Researcher Affiliation	Academia	1 College of Computing and Data Science, Nanyang Technological University, Singapore 2 University of California, Los Angeles
Pseudocode	No	The paper describes the methodology in prose and through a framework diagram (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our codes are available at https://github.com/JZ-9962/SPAZER. We will release the full codebase with data preprocessing, model implementation, and evaluation scripts upon publication.
Open Datasets	Yes	We evaluate our method on two widely-used 3D visual grounding benchmarks: Scan Refer [6] and Nr3D [1], which are built upon the Scan Net [9] dataset.
Dataset Splits	Yes	To enable fair comparison and reduce expenditure, our main experiments are conducted on the same Scan Refer and Nr3D subsets as [48]. We follow previous work VLM-Grounder [48] to evaluate our agent on the subset (250 selected samples) of each dataset.
Hardware Specification	Yes	The experiments involving Qwen2-VL-72B and Qwen2.5-VL-72B are conducted on multiple NVIDIA H100 GPUs.
Software Dependencies	Yes	Our agent adopts GPT-4o as the default VLM... The default VLM of our agent is GPT-4o (gpt-4o-2024-08-06). The experiments involving Qwen2-VL-72B and Qwen2.5-VL-72B are conducted on multiple NVIDIA H100 GPUs.
Experiment Setup	Yes	Our agent adopts GPT-4o as the default VLM. The number of views n is set to 4, and the Top-k parameter is set to k = 4. For Scan Refer dataset, we follow prior works [54, 22] and use a pre-trained model [34] to obtain the 3D bounding boxes. All ablation studies are conducted on the same subset of Nr3D as [48]. The temperature is set to 0.2 to improve the reproducibility of the results.