Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models

Authors: Zhimin Chen, Longlong Jing, Yingwei Li, Bing Li

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on multiple datasets, including SUN RGB-D [75] and Scan Net [14] for 3D object detection and S3DIS [5] for 3D semantic segmentation. Our approach outperforms state-of-the-art self-supervised learning methods in both tasks, demonstrating the effectiveness of our proposed framework.
Researcher Affiliation Academia Zhimin Chen1 Clemson University zhiminc@clemson.edu Longlong Jing2 The City University of New York ljing@gradcenter.cuny.edu Yingwei Li3 Johns Hopkins University yingwei.li@jhu.edu Bing Li B1 Clemson University bli4@clemson.edu
Pseudocode No No structured pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes Code will be available at: https://github.com/Zhimin-C/Bridge3D
Open Datasets Yes We evaluate our method on multiple datasets, including SUN RGB-D [75] and Scan Net [14] for 3D object detection and S3DIS [5] for 3D semantic segmentation.
Dataset Splits No The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages, sample counts, or references to predefined splits) needed for reproduction.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions models and optimizers used (e.g., 'Point MAE', 'DINOV2', 'CLIP Vi T-B', 'Tag2text', 'Adam W') but does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes Pre-training. During this stage, we perform training of the model for 120 epochs by employing the Scan Net dataset [14]... We use Adam W [45] optimizer with a base learning rate of 5e-4 and weight decay of 5e-2, along with a batch size of 64. The whole masking ratio rw is set to 70% and the drop ratio rd is set to 40%. The cosine learning rate scheduler is applied, with a drop path rate and warm-up epochs set to 0.1 and 10, respectively. The encoder depth is set to 6, and we utilize the same decoder as Point-MAE [50], with the decoder depth set to 2.