Language-Assisted 3D Feature Learning for Semantic Scene Understanding

Authors: Junbo Zhang, Guofan Fan, Guanghan Wang, Zhengyuan Su, Kaisheng Ma, Li Yi

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on several benchmarks of 3Donly and 3D-language tasks demonstrate the effectiveness of our language-assisted 3D feature learning.
Researcher Affiliation Collaboration Junbo Zhang1, Guofan Fan2, Guanghan Wang1, Zhengyuan Su1, Kaisheng Ma1 , Li Yi1,3,4* 1Tsinghua University 2Xi an Jiaotong University 3Shanghai Artificial Intelligence Laboratory 4Shanghai Qi Zhi Institute
Pseudocode No The paper describes the method verbally and with diagrams but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/Asterisci/Language-Assisted-3D.
Open Datasets Yes We adopt the Scan Net V2 (Dai et al. 2017) dataset for 3D detection and instance segmentation. Scan Net V2 provides 1, 513 indoor scans with semantic and instance segmentation annotations for 18 object categories. We perform language-assisted training based on a widely used point cloud based visual grounding dataset, Scan Refer (Chen, Chang, and Nießner 2020).
Dataset Splits Yes We follow the Scan Refer benchmark to split the train/val/test set with 36, 655, 9, 508, and 5, 410 samples, respectively.
Hardware Specification No The paper discusses models, optimizers, and training parameters but does not specify any particular hardware (e.g., GPU, CPU models) used for the experiments.
Software Dependencies No The paper mentions several models and components like Glo VE and GRU, but does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes For language-assisted training, we jointly train the 3D perception task and the auxiliary tasks for 70 epochs with an Adam optimizer and an initial learning rate of 0.001. The learning rate is multiplied by 0.3 after 50 and 60 epochs. We set α, β and γ to 0.1, 0.05 and 0.05. Vote Net and MLCVNet are trained with the batch size of 12, in which each scene is paired with one description. Point Group is trained with the batch size of 4, in which each scene is paired with 8 descriptions to balance the training time.