Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Authors: Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. We conduct extensive evaluations on the VSI-Bench [18], Scan QA [38], and SQA3D [39] benchmarks and demonstrate that the proposed Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks.
Researcher Affiliation Academia Diankun Wu Tsinghua University Fangfu Liu Tsinghua University Yi-Hsin Hung Tsinghua University Yueqi Duan Tsinghua University
Pseudocode Yes Algorithm 1 Greedy Maximum Coverage Sampling
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will release the data and code upon acceptance.
Open Datasets Yes We conduct extensive evaluations on the VSI-Bench [18], Scan QA [38], and SQA3D [39] benchmarks and demonstrate that the proposed Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. We construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. We first construct a visual-based 3D spatial question-answering dataset. The dataset has approximately 120k QA pairs and is constructed from three sources: the training set of Scan QA [38], SQA3D [39], as well as additional self-created spatial QA data.
Dataset Splits Yes We use the val set of Scan QA and test set of SQA3D for evaluation following common practice [22, 47, 25]. For the self-created QA data, we follow the data processing pipeline proposed in VSI-Bench [18]. Specifically, we first convert Scan Net scenes into continuous video clips at 24 FPS and 640 480 resolution. Finally, the self-created data contains approximately 70k QA pairs in total. During inference, we set Nm = 128 and Nk = 16 for space-aware frame sampling.
Hardware Specification Yes We conduct all experiments on Intel(R) Xeon(R) Gold 6430 platform with 80G NVIDIA A800 GPUs.
Software Dependencies No Spatial-MLLM is built on Qwen2.5-VL [14] and VGGT [32] and has approximately 4.9B parameters in total. We use the visual encoder of Qwen2.5-VL [14] to initialize E2D, and the LLM backbone of it to initialize fθ. We then use the feature backbone of VGGT [32] to initialize Espatial.
Experiment Setup Yes During training, we use 640 480 resolution and limit video frames to 16. In the SFT stage, we train the model using Adam optimizer [65] for one epoch. We set the global batch size to 16 and use a linear learning-rate schedule, with a peak value of 10 5. In the RL stage, we perform 8 rollouts per question and set the default sampling temperature to 1. The KL divergence coefficient, β, is set to 0.04. Due to computational resource limitations, we train the model for 1,000 steps with a learning rate of 10 6. During inference, we set Nm = 128 and Nk = 16 for space-aware frame sampling. Since spatial reasoning requires a certain level of determinism, we set the temperature to 0.1 and the top-p to 0.001.