Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

Authors: Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, Donglin Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding.
Researcher Affiliation	Academia	1Westlake University, 2Zhejiang University, 3Harbin Institute of Technology, 4The Hong Kong University of Science and Technology (Guangzhou), 5Shanghai Innovation Institute EMAIL
Pseudocode	No	The paper describes the methodology with figures and text, but does not present any explicitly labeled pseudocode or algorithm blocks for the core methodology. It does provide structured instructions for data generation for LLMs, but these are not 'pseudocode or algorithm blocks' for the proposed SSR method.
Open Source Code	No	We will publicly release the data and code once they have been finalized and prepared.
Open Datasets	Yes	To enable comprehensive evaluation, we introduce a new dataset named SSR-COT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBENCH, a comprehensive multi-task benchmark. ... We will publicly release the data and code once they have been finalized and prepared. The paper also cites many publicly available datasets like LLaVA-Co T [51], Visual-Co T [52], Vo Co T [53], Spatial QA [17] that were used for SSR-COT collection.
Dataset Splits	Yes	To enable comprehensive evaluation, we introduce a new dataset named SSR-COT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBENCH, a comprehensive multi-task benchmark. ... SSRBENCH consists of two primary categories, general understanding and spatial understanding, allowing simultaneous evaluation of VLM performance in both general question answering and spatial reasoning tasks. Each category contains three distinct evaluation tasks, with detailed sample sizes provided in Appendix E.
Hardware Specification	Yes	training SSR requires approximately 19 hours for Stage 1 and 48 hours for Stage 2, using a single Nvidia 8-H800 GPU node equipped with 80GB VRAM.
Software Dependencies	No	In this paper, we utilize Mamba [71] as the lower-level efficient language model for reasoning, Qwen2.5 [38] as the LLM for alignment in the first training stage, and Qwen2.5-VL [91] as the VLM supporting multi-modal comprehension in the second training stage. ... Optimizer Adam W [108]. While specific models are mentioned with citations, explicit version numbers for general software dependencies like Python, PyTorch, or CUDA are not provided.
Experiment Setup	Yes	Detailed hyperparameter configurations are provided in Table 9. Table 9: Training hyper-parameters of our proposed SSR. Configuration Stage 1 Stage 2 Optimizer Adam W [108] Learning Rate 0.00002 Numerical Precision BFloat16 Epoch 2 1 Global Batch Size 32 32 Question Length 256 Rational Length 1024 N/A Answering Length N/A 256 Number of Latent Tokens 10 Learning Schedule Cosine Decay Warm-up Ratio 0.02