Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers
Authors: Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai WANG, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, Zhou Zhao
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform comprehensive experiments across five representative 3D scene-language datasets, including Scan Refer [4], Multi3DRefer [75], Scan2Cap [12], Scan QA [2], and SQA3D [38]. Our model consistantly enhances state-of-the-art performance across all these datasets without fine-tuning on specific tasks. Notably, it surpasses previous methods by 3.7% (Acc@0.5) on Scan Refer, 14.0% (F1@0.5) on Multi3DRefer, 8.7% (CIDEr@0.5) on Scan2Cap, and 7.7% (CIDEr) on Scan QA. |
| Researcher Affiliation | Collaboration | Haifeng Huang1,2 Yilun Chen2 Zehan Wang1 Rongjie Huang1 Runsen Xu2 Tai Wang2 Luping Liu1 Xize Cheng1 Yang Zhao3 Jiangmiao Pang2 Zhou Zhao1,2 1Zhejiang University 2Shanghai AI Laboratory 3Bytedance Inc. |
| Pseudocode | No | The paper describes its model architecture and training strategy in text and diagrams but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code has been released at https://github.com/Zz ZZCHS/Chat-Scene. |
| Open Datasets | Yes | We conducted experiments on five benchmarks: Scan Refer [4] for single-object visual grounding, Multi3DRefer [75] for multi-object visual grounding, Scan2Cap [12] for dense captioning, and both Scan QA [2] and SQA3D [38] for visual question answering. These benchmarks are based on the Scan Net dataset [16], which comprises richly annotated RGB-D scans of real-world indoor scenes, including both 2D and 3D data across 1,513 scans. |
| Dataset Splits | Yes | All benchmarks adhere to the same train/validation/test splits, facilitating joint training and evaluation. |
| Hardware Specification | Yes | The entire training process takes approximately 8 hours on 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using specific models like Vicuna-7B-v1.5 and LoRA, and pre-trained encoders like Uni3D [78] and DINOv2 [40], but does not list specific versions for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The base learning rate is set to 5e-6 with a cosine annealing schedule, and the batch size is 32. The training takes 2 epochs and the total training step is 3200. |