Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding

Authors: Taolin Zhang, Sunan He, Tao Dai, Zhi Wang, Bin Chen, Shu-Tao Xia

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments verify the excellent performance of 3DVLP on three 3D vision-language tasks, reflecting its superiority in semantic 3D scene understanding.
Researcher Affiliation Academia 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Department of Computer Science and Engineering , Hong Kong University of Science and Technology 3College of Computer Science and Software Engineering, Shenzhen University 4Harbin Institute of Technology, Shenzhen 5Research Center of Artifcial Intelligence, Peng Cheng Laboratory
Pseudocode No The paper describes the methodology in prose and uses diagrams to illustrate concepts but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/iridescentttt/3DVLP.
Open Datasets Yes Visual Grounding Dataset: We select the benchmark dataset Scan Refer (Chen, Chang, and Nießner 2020) for visual grounding task. It consists of 800 3D scenes from the Scan Net dataset (Dai et al. 2017)
Dataset Splits No The paper mentions benchmark datasets like Scan Refer, ScanNet, Scan2Cap, and Scan QA, but does not explicitly detail the training, validation, or test splits used for these datasets.
Hardware Specification Yes Codes are implemented by Pytorch and run on a Nvidia 3090 GPU.
Software Dependencies No Codes are implemented by Pytorch and run on a Nvidia 3090 GPU. (No version specified for Pytorch or other dependencies).
Experiment Setup Yes We first train 3DVLP over the proposed proxy tasks including visual grounding, OCC and OSC in the pre-training stage for 200 epochs. We set the batch size as 8 and the initial learning rate is set to be 0.002 for the detector and 5e-4 for other modules in the 3DVLP.