Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers
Authors: Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai WANG, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, Zhou Zhao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform comprehensive experiments across five representative 3D scene-language datasets, including Scan Refer [4], Multi3DRefer [75], Scan2Cap [12], Scan QA [2], and SQA3D [38]. Our model consistantly enhances state-of-the-art performance across all these datasets without fine-tuning on specific tasks. Notably, it surpasses previous methods by 3.7% (Acc@0.5) on Scan Refer, 14.0% (F1@0.5) on Multi3DRefer, 8.7% (CIDEr@0.5) on Scan2Cap, and 7.7% (CIDEr) on Scan QA. |
| Researcher Affiliation | Collaboration | Haifeng Huang1,2 Yilun Chen2 Zehan Wang1 Rongjie Huang1 Runsen Xu2 Tai Wang2 Luping Liu1 Xize Cheng1 Yang Zhao3 Jiangmiao Pang2 Zhou Zhao1,2 1Zhejiang University 2Shanghai AI Laboratory 3Bytedance Inc. |
| Pseudocode | No | The paper describes its model architecture and training strategy in text and diagrams but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code has been released at https://github.com/Zz ZZCHS/Chat-Scene. |
| Open Datasets | Yes | We conducted experiments on five benchmarks: Scan Refer [4] for single-object visual grounding, Multi3DRefer [75] for multi-object visual grounding, Scan2Cap [12] for dense captioning, and both Scan QA [2] and SQA3D [38] for visual question answering. These benchmarks are based on the Scan Net dataset [16], which comprises richly annotated RGB-D scans of real-world indoor scenes, including both 2D and 3D data across 1,513 scans. |
| Dataset Splits | Yes | All benchmarks adhere to the same train/validation/test splits, facilitating joint training and evaluation. |
| Hardware Specification | Yes | The entire training process takes approximately 8 hours on 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using specific models like Vicuna-7B-v1.5 and LoRA, and pre-trained encoders like Uni3D [78] and DINOv2 [40], but does not list specific versions for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The base learning rate is set to 5e-6 with a cosine annealing schedule, and the batch size is 32. The training takes 2 epochs and the total training step is 3200. |