Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
Authors: Taolin Zhang, Sunan He, Tao Dai, Zhi Wang, Bin Chen, Shu-Tao Xia
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments verify the excellent performance of 3DVLP on three 3D vision-language tasks, reflecting its superiority in semantic 3D scene understanding. |
| Researcher Affiliation | Academia | 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Department of Computer Science and Engineering , Hong Kong University of Science and Technology 3College of Computer Science and Software Engineering, Shenzhen University 4Harbin Institute of Technology, Shenzhen 5Research Center of Artifcial Intelligence, Peng Cheng Laboratory |
| Pseudocode | No | The paper describes the methodology in prose and uses diagrams to illustrate concepts but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/iridescentttt/3DVLP. |
| Open Datasets | Yes | Visual Grounding Dataset: We select the benchmark dataset Scan Refer (Chen, Chang, and Nießner 2020) for visual grounding task. It consists of 800 3D scenes from the Scan Net dataset (Dai et al. 2017) |
| Dataset Splits | No | The paper mentions benchmark datasets like Scan Refer, ScanNet, Scan2Cap, and Scan QA, but does not explicitly detail the training, validation, or test splits used for these datasets. |
| Hardware Specification | Yes | Codes are implemented by Pytorch and run on a Nvidia 3090 GPU. |
| Software Dependencies | No | Codes are implemented by Pytorch and run on a Nvidia 3090 GPU. (No version specified for Pytorch or other dependencies). |
| Experiment Setup | Yes | We first train 3DVLP over the proposed proxy tasks including visual grounding, OCC and OSC in the pre-training stage for 200 epochs. We set the batch size as 8 and the initial learning rate is set to be 0.002 for the detector and 5e-4 for other modules in the 3DVLP. |