Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation
Authors: Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, Tongliang Liu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations of various challenging indoor scene benchmarks demonstrate that, even without labeled 3D training data, MLLM-For3D outperforms existing 3D reasoning segmentation methods, effectively interpreting user intent, understanding 3D scenes, and reasoning about spatial relationships. ... 4 Experiments In this section, we present the experimental results for three challenging tasks, focusing on 3D reasoning segmentation, intention grounding. ... 4.3 Ablation Studies & Analysis |
| Researcher Affiliation | Collaboration | Jiaxin Huang1 Runnan Chen2, Ziwen Li1 Zhengqing Gao1 Xiao He4 Yandong Guo4 Mingming Gong1,3 Tongliang Liu1,2, 1MBZUAI 2The University of Sydney 3The University of Melbourne 4AI2Robotic |
| Pseudocode | No | The paper describes the methodology in prose and figures (Figure 2), but does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Corresponding authors Code available at: https://github.com/tmllab/2025_NeurIPS_MLLM-For3D |
| Open Datasets | Yes | We evaluate the performance of 2D reasoning segmentation models, LISA [36] on the ScanNet++ dataset [64]... For the 3D reasoning segmentation task, we adopt Reason3D [23] (derived from ScanNet [12] and Matterport3D [3]) and Instruct3D [19] (derived from ScanNet++ v1 [64]) as benchmarks... Finally, we evaluate grounding on two 3D spatial-reasoning datasets built on ScanNet scenes: 3D-IG (Intent3D) [32] and VG w/o ON [59]. |
| Dataset Splits | Yes | The Reason3D dataset [23] provides query-conditioned object masks on ScanNet V2 [12] and Matterport3D [3]. We follow the official splits and statistics: Matterport3D contributes 934 training and 837 validation samples, while ScanNet V2 contributes 405 training and 308 validation samples. ... The filtered Instruct3D split contains 136 training scenes and 45 validation scenes, yielding 1,034 and 321 query-answer (QA) pairs, respectively. |
| Hardware Specification | Yes | Training is performed on four NVIDIA A100 GPUs (40 GB each). |
| Software Dependencies | No | Our 3D segmentation network is implemented using MinkowskiNet14 as the backbone, built on the PyTorch framework. ... We adopt the Minkowski Engine's 3D U-Net backbone ("MinkowskiNet14"). The text mentions software components like PyTorch and Minkowski Engine/MinkowskiNet14 but does not provide specific version numbers for PyTorch or the Minkowski Engine library itself, only the specific network architecture MinkowskiNet14. |
| Experiment Setup | Yes | For optimization, we employ stochastic gradient descent (SGD) with momentum set to 0.9 and a weight decay of 1×10−4. Data augmentations such as random rotation around the upright axis, random flips on point clouds, and random horizontal flips and resized crops on images were consistently applied to enhance model generalization. ... Table 5: Training configurations across different datasets. Dataset Backbone Batch Size LR Epochs GPUs Voxel Size Max Sweeps ScanNet V2 MinkowskiNet14 8 0.10 40 4 0.05 m 1 Matterport3D MinkowskiNet14 4 0.10 40 4 0.05 m 1 ScanNet++ (Instruct3D) MinkowskiNet14 4 0.01 40 4 0.05 m 1 |