Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Authors: Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai WANG, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, Zhou Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform comprehensive experiments across five representative 3D scene-language datasets, including Scan Refer [4], Multi3DRefer [75], Scan2Cap [12], Scan QA [2], and SQA3D [38]. Our model consistantly enhances state-of-the-art performance across all these datasets without fine-tuning on specific tasks. Notably, it surpasses previous methods by 3.7% (Acc@0.5) on Scan Refer, 14.0% (F1@0.5) on Multi3DRefer, 8.7% (CIDEr@0.5) on Scan2Cap, and 7.7% (CIDEr) on Scan QA.
Researcher Affiliation Collaboration Haifeng Huang1,2 Yilun Chen2 Zehan Wang1 Rongjie Huang1 Runsen Xu2 Tai Wang2 Luping Liu1 Xize Cheng1 Yang Zhao3 Jiangmiao Pang2 Zhou Zhao1,2 1Zhejiang University 2Shanghai AI Laboratory 3Bytedance Inc.
Pseudocode No The paper describes its model architecture and training strategy in text and diagrams but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code has been released at https://github.com/Zz ZZCHS/Chat-Scene.
Open Datasets Yes We conducted experiments on five benchmarks: Scan Refer [4] for single-object visual grounding, Multi3DRefer [75] for multi-object visual grounding, Scan2Cap [12] for dense captioning, and both Scan QA [2] and SQA3D [38] for visual question answering. These benchmarks are based on the Scan Net dataset [16], which comprises richly annotated RGB-D scans of real-world indoor scenes, including both 2D and 3D data across 1,513 scans.
Dataset Splits Yes All benchmarks adhere to the same train/validation/test splits, facilitating joint training and evaluation.
Hardware Specification Yes The entire training process takes approximately 8 hours on 4 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using specific models like Vicuna-7B-v1.5 and LoRA, and pre-trained encoders like Uni3D [78] and DINOv2 [40], but does not list specific versions for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes The base learning rate is set to 5e-6 with a cosine annealing schedule, and the batch size is 32. The training takes 2 epochs and the total training step is 3200.