3D Vision and Language Pretraining with Large-Scale Synthetic Data
Authors: Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, Yang Liu
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on downstream tasks including visual grounding, dense captioning, and question answering. and The efficacy of our model is evaluated across 3DVL tasks, including visual grounding (e.g., Scan Refer [Chen et al., 2020], Nr3D/Sr3D [Achlioptas et al., 2020]), dense captioning (e.g., Scan2Cap [Chen et al., 2021]), question answering (e.g., Scan QA [Azuma et al., 2022]). |
| Researcher Affiliation | Academia | 1 Wangxuan Institute of Computer Technology, Peking University 2 National Institute of Health Data Science, Peking University 3 National Key Laboratory of General Artificial Intelligence, Peking University 4State Key Laboratory of General Artificial Intelligence, BIGAI |
| Pseudocode | No | The paper describes its methods in text and equations but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Codes are available at: https://github.com/idejie/3DSyn. |
| Open Datasets | Yes | To overcome these obstacles, we construct Syn VL3D, a comprehensive synthetic scenetext corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels and The efficacy of our model is evaluated across 3DVL tasks, including visual grounding (e.g., Scan Refer [Chen et al., 2020], Nr3D/Sr3D [Achlioptas et al., 2020]), dense captioning (e.g., Scan2Cap [Chen et al., 2021]), question answering (e.g., Scan QA [Azuma et al., 2022]). |
| Dataset Splits | No | The paper mentions evaluating on various benchmarks (e.g., Nr3D, Sr3D, Scan Refer, Scan2Cap, Scan QA) but does not explicitly state the training/validation/test splits with percentages or sample counts for these datasets within the provided text. It refers to following existing practices for evaluation (e.g., we follow [Chen et al., 2020]), implying standard splits are used, but they are not detailed here. |
| Hardware Specification | Yes | The pre-training runs for 100 epochs with a batch size of 64 with NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using Adam W [Loshchilov and Hutter, 2019] optimizer but does not provide specific version numbers for other key software components or libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | The pre-training runs for 100 epochs with a batch size of 64 with NVIDIA A100 GPUs. We set the balance hyper-parameters α, β as 0.5 and 0.8. We use the Adam W [Loshchilov and Hutter, 2019] optimizer and learning rate is set to 1e 4. |