reproducibilityindex.ai

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Authors: Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai WANG, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, Zhou Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform comprehensive experiments across five representative 3D scene-language datasets, including Scan Refer [4], Multi3DRefer [75], Scan2Cap [12], Scan QA [2], and SQA3D [38]. Our model consistantly enhances state-of-the-art performance across all these datasets without fine-tuning on specific tasks. Notably, it surpasses previous methods by 3.7% (Acc@0.5) on Scan Refer, 14.0% (F1@0.5) on Multi3DRefer, 8.7% (CIDEr@0.5) on Scan2Cap, and 7.7% (CIDEr) on Scan QA.
Researcher Affiliation	Collaboration	Haifeng Huang1,2 Yilun Chen2 Zehan Wang1 Rongjie Huang1 Runsen Xu2 Tai Wang2 Luping Liu1 Xize Cheng1 Yang Zhao3 Jiangmiao Pang2 Zhou Zhao1,2 1Zhejiang University 2Shanghai AI Laboratory 3Bytedance Inc.
Pseudocode	No	The paper describes its model architecture and training strategy in text and diagrams but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code has been released at https://github.com/Zz ZZCHS/Chat-Scene.
Open Datasets	Yes	We conducted experiments on five benchmarks: Scan Refer [4] for single-object visual grounding, Multi3DRefer [75] for multi-object visual grounding, Scan2Cap [12] for dense captioning, and both Scan QA [2] and SQA3D [38] for visual question answering. These benchmarks are based on the Scan Net dataset [16], which comprises richly annotated RGB-D scans of real-world indoor scenes, including both 2D and 3D data across 1,513 scans.
Dataset Splits	Yes	All benchmarks adhere to the same train/validation/test splits, facilitating joint training and evaluation.
Hardware Specification	Yes	The entire training process takes approximately 8 hours on 4 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using specific models like Vicuna-7B-v1.5 and LoRA, and pre-trained encoders like Uni3D [78] and DINOv2 [40], but does not list specific versions for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	The base learning rate is set to 5e-6 with a cosine annealing schedule, and the batch size is 32. The training takes 2 epochs and the total training step is 3200.