3D Vision and Language Pretraining with Large-Scale Synthetic Data

Authors: Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, Yang Liu

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on downstream tasks including visual grounding, dense captioning, and question answering. and The efficacy of our model is evaluated across 3DVL tasks, including visual grounding (e.g., Scan Refer [Chen et al., 2020], Nr3D/Sr3D [Achlioptas et al., 2020]), dense captioning (e.g., Scan2Cap [Chen et al., 2021]), question answering (e.g., Scan QA [Azuma et al., 2022]).
Researcher Affiliation Academia 1 Wangxuan Institute of Computer Technology, Peking University 2 National Institute of Health Data Science, Peking University 3 National Key Laboratory of General Artificial Intelligence, Peking University 4State Key Laboratory of General Artificial Intelligence, BIGAI
Pseudocode No The paper describes its methods in text and equations but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Codes are available at: https://github.com/idejie/3DSyn.
Open Datasets Yes To overcome these obstacles, we construct Syn VL3D, a comprehensive synthetic scenetext corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels and The efficacy of our model is evaluated across 3DVL tasks, including visual grounding (e.g., Scan Refer [Chen et al., 2020], Nr3D/Sr3D [Achlioptas et al., 2020]), dense captioning (e.g., Scan2Cap [Chen et al., 2021]), question answering (e.g., Scan QA [Azuma et al., 2022]).
Dataset Splits No The paper mentions evaluating on various benchmarks (e.g., Nr3D, Sr3D, Scan Refer, Scan2Cap, Scan QA) but does not explicitly state the training/validation/test splits with percentages or sample counts for these datasets within the provided text. It refers to following existing practices for evaluation (e.g., we follow [Chen et al., 2020]), implying standard splits are used, but they are not detailed here.
Hardware Specification Yes The pre-training runs for 100 epochs with a batch size of 64 with NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using Adam W [Loshchilov and Hutter, 2019] optimizer but does not provide specific version numbers for other key software components or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes The pre-training runs for 100 epochs with a batch size of 64 with NVIDIA A100 GPUs. We set the balance hyper-parameters α, β as 0.5 and 0.8. We use the Adam W [Loshchilov and Hutter, 2019] optimizer and learning rate is set to 1e 4.