Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
Authors: Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, Donglin Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding. |
| Researcher Affiliation | Academia | 1Westlake University, 2Zhejiang University, 3Harbin Institute of Technology, 4The Hong Kong University of Science and Technology (Guangzhou), 5Shanghai Innovation Institute EMAIL |
| Pseudocode | No | The paper describes the methodology with figures and text, but does not present any explicitly labeled pseudocode or algorithm blocks for the core methodology. It does provide structured instructions for data generation for LLMs, but these are not 'pseudocode or algorithm blocks' for the proposed SSR method. |
| Open Source Code | No | We will publicly release the data and code once they have been finalized and prepared. |
| Open Datasets | Yes | To enable comprehensive evaluation, we introduce a new dataset named SSR-COT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBENCH, a comprehensive multi-task benchmark. ... We will publicly release the data and code once they have been finalized and prepared. The paper also cites many publicly available datasets like LLaVA-Co T [51], Visual-Co T [52], Vo Co T [53], Spatial QA [17] that were used for SSR-COT collection. |
| Dataset Splits | Yes | To enable comprehensive evaluation, we introduce a new dataset named SSR-COT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBENCH, a comprehensive multi-task benchmark. ... SSRBENCH consists of two primary categories, general understanding and spatial understanding, allowing simultaneous evaluation of VLM performance in both general question answering and spatial reasoning tasks. Each category contains three distinct evaluation tasks, with detailed sample sizes provided in Appendix E. |
| Hardware Specification | Yes | training SSR requires approximately 19 hours for Stage 1 and 48 hours for Stage 2, using a single Nvidia 8-H800 GPU node equipped with 80GB VRAM. |
| Software Dependencies | No | In this paper, we utilize Mamba [71] as the lower-level efficient language model for reasoning, Qwen2.5 [38] as the LLM for alignment in the first training stage, and Qwen2.5-VL [91] as the VLM supporting multi-modal comprehension in the second training stage. ... Optimizer Adam W [108]. While specific models are mentioned with citations, explicit version numbers for general software dependencies like Python, PyTorch, or CUDA are not provided. |
| Experiment Setup | Yes | Detailed hyperparameter configurations are provided in Table 9. Table 9: Training hyper-parameters of our proposed SSR. Configuration Stage 1 Stage 2 Optimizer Adam W [108] Learning Rate 0.00002 Numerical Precision BFloat16 Epoch 2 1 Global Batch Size 32 32 Question Length 256 Rational Length 1024 N/A Answering Length N/A 256 Number of Latent Tokens 10 Learning Schedule Cosine Decay Warm-up Ratio 0.02 |