Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
3DRS: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
Authors: Xiaohu Huang, Jingjing Wu, Qunyi Xie, Kai Han
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across multiple benchmarks and MLLMs including visual grounding, captioning, and question answering demonstrate consistent performance gains. Project page: https://visual-ai.github.io/3drs |
| Researcher Affiliation | Collaboration | 1 Visual AI Lab, The University of Hong Kong 2 Department of Computer Vision Technology (VIS), Baidu Inc. |
| Pseudocode | No | The paper describes the methodology using textual explanations and figures (e.g., Figure 1 and Figure 3a), but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | We will publicly release the code and related instructions in the near future. |
| Open Datasets | Yes | We evaluate our approach on six benchmarks that collectively span key challenges in 3D scene understanding. Scan Refer [5] focuses on localizing objects using free-form language, while Multi3DRefer [59] generalizes this to queries referencing zero, one, or multiple objects, better reflecting real-world ambiguity. Scan2Cap [12] addresses dense captioning by pairing detected objects in 3D scans with natural language descriptions. For question answering, Scan QA [2] tasks models with answering open-ended questions grounded in 3D geometry and semantics, and SQA3D [32] goes further by requiring situated reasoning: agents must interpret their position and context to answer complex queries. All these datasets are sourced from the richly annotated Scan Net [13] corpus, and we follow standard validation and test splits as established in prior work [24, 65, 9, 63]. Besides, VSI-Bench [53] is used to evaluate the performance on visual-based spatial understanding tasks, which are composed of numerical and multiple-choice questions. |
| Dataset Splits | Yes | All these datasets are sourced from the richly annotated Scan Net [13] corpus, and we follow standard validation and test splits as established in prior work [24, 65, 9, 63]. Besides, VSI-Bench [53] is used to evaluate the performance on visual-based spatial understanding tasks, which are composed of numerical and multiple-choice questions. ... Specifically, we follow the model finetuning settings of Video-3D LLM [63] by using the validation splits of Scan Refer, Multi3DRefer, Scan2Cap, and Scan QA, as well as the test split of SQA3D. |
| Hardware Specification | Yes | We use 8 H100 NVIDIA GPUs for all experiments. |
| Software Dependencies | No | The paper mentions using 'Adam' as an optimizer and specific MLLMs (LLaVA-Next-Video 7B, LLaVA-One Vision 7B, Qwen2-VL 7B), but does not provide version numbers for programming languages, libraries, or frameworks like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For both training and inference, we uniformly sample 32 frames per scan to construct multi-view image sets. For evaluating the correspondence score, we use the voxel size of 0.1 for voxelization. All models are optimized using Adam, with a batch size of 16 and a warm-up ratio of 0.03. The learning rates are set to a maximum of 1e-5 for the language model and 2e-6 for the visual backbone during the warm-up period. During training for visual grounding and dense captioning, ground truth object regions are used as candidates, whereas during inference, we follow the procedure of [24, 25, 63] and employ Mask3D [38] to generate object proposals. For LLa VA-Next-Video and LLa VA-One Vision, we finetune all model parameters. For Qwen2-VL, due to GPU memory constraints, we finetune only the projector and the LLM components. |