Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
Authors: Pingyi Chen, Yujing Lou, Shen Cao, Jinhui Guo, Lubin Fan, Yue Wu, Lin F. Yang, Lizhuang Ma, Jieping Ye
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and Spatial RGPTBench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. |
| Researcher Affiliation | Collaboration | 1Zhejiang University, 2Westlake University, 3Alibaba Cloud Computing, 4Shanghai Jiao Tong University |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are found in the paper. The paper describes methodologies using diagrams and textual explanations, such as Figure 2 'Overview of the data generation pipeline of MSMU' and Figure 5 'The architecture of SD-VLM'. |
| Open Source Code | Yes | Code and models are released at https://github.com/cpystan/SD-VLM. |
| Open Datasets | Yes | Hence, we propose MSMU dataset, namely Massive Spatial Measuring and Understanding dataset shown in Figure 1, a large-scale quantitative spatial reasoning dataset comprising about 25K images and 700K QA pairs (including 10K chain-of-thought samples) from 2K real 3D scenes, with 2.5M numerical annotations. We employ this data generation pipeline to construct VQA pairs from Scan Net [55] and Scan Net++ [54]. We have also evaluated our model on other spatial datasets including Q-Spatial++ [45] and Spatial RGPT-Bench [21]. |
| Dataset Splits | Yes | To address this issue, we have meticulously developed MSMU-Bench, a held-out benchmark from MSMU, designed to rigorously assess the advanced spatial reasoning capabilities of VLMs. As shown in Figure 3 (right), MSMU-Bench contains more quantitative QAs compared to other spatial benchmarks. Comprising about 1K spatial VQA pairs, this benchmark features samples from unseen scans. |
| Hardware Specification | Yes | The model is trained on 8 V100 GPUs, with the batch size of 2 per GPU, using 32 GPU hours. |
| Software Dependencies | No | The paper mentions several models and frameworks such as 'SD-VLM is built upon pretrained LLa VA-1.5-7B', 'The vision encoder is CLIP-Vi T/14', and 'The external depth estimation model is Depth-Anything-V2 [48]', but it does not specify software version numbers for key components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | SD-VLM is built upon pretrained LLa VA-1.5-7B. The model is fine-tuned with Lo RA [56] on MSMU for one epoch. The model is trained on 8 V100 GPUs, with the batch size of 2 per GPU, using 32 GPU hours. The vision encoder is CLIP-Vi T/14. The external depth estimation model is Depth-Anything-V2 [48]. In the training phase, the vision encoder remains frozen. The learning rates for LLM and the projector are 2e-4 and 2e-5, respectively. The threshold for GPT-4 evaluation in MSMU-Bench is 1.25. |