Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
3D Question Answering via only 2D Vision-Language Models
Authors: Fengyun Wang, Sicheng Yu, Jiawei Wu, Jinhui Tang, Hanwang Zhang, Qianru Sun
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate cd Views on the widely-used Scan QA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks. |
| Researcher Affiliation | Academia | 1Nanyang Technological University, Singapore 2Singapore Management University, Singapore 3National University of Singapore, Singapore 4Nanjing University of Science & Technology, Nanjing, China. Correspondence to: Qianru Sun <EMAIL>. |
| Pseudocode | No | The paper describes the cd Views framework, view Selector, view NMS, and view Annotator in detailed prose, but does not present any of these as structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https: //github.com/fereenwong/cd Views. |
| Open Datasets | Yes | We evaluate cd Views on the widely-used Scan QA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks. |
| Dataset Splits | Yes | Scan QA contains over 41K question-answer annotations across 800 indoor 3D scenes, which are divided into train, val, and test sets (with or without objects). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | We utilize a recent state-of-the-art LVLM, i.e., LLAVA-OV-7B (Li et al., 2024a), as the 2D LVLM for all experiments, including view Annotator and 3D-QA. The model remains frozen throughout all experiments. Analysis on more LVLM backbones is shown in Appendix C. |
| Experiment Setup | Yes | Training of the view Selector is conducted with a learning rate of 5 * 10^-5 and a batch size of 8. Each training iteration samples 5 positive and 5 negative views per instance generated by view Annotator. Here the number of views, e.g., k=9 for cd Views, is selected on the validation set (Figure 4). |