SAM-Guided Masked Token Prediction for 3D Scene Understanding

Authors: Zhimin Chen, Liang Yang, Yingwei Li, Longlong Jing, Bing Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our methodology has been validated across multiple datasets, including SUN RGB-D, Scan Net, and S3DIS, for tasks like 3D object detection and semantic segmentation. The results demonstrate significant improvements over current State-of-the-art self-supervised methods, establishing new benchmarks in this field.
Researcher Affiliation Academia Zhimin Chen1 Clemson University Liang Yang2 The City University of New York Yingwei Li3 Johns Hopkins University Longlong Jing2 The City University of New York Bing Li B1 Clemson University
Pseudocode No The paper describes methods and processes in narrative text and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Code will be released upon acceptance.
Open Datasets Yes We validated our methodology across multiple datasets and tasks, including SUN RGB-D [46] and Scan Net [14] for 3D object detection, and S3DIS [4] and Scan Net [14] for 3D semantic segmentation.
Dataset Splits Yes We follow the official protocol for training/validation splits, extracting 78,000 frames from the training subset by sampling one frame every 25 frames to construct our dataset.
Hardware Specification Yes The training is conducted using four A100 GPUs.
Software Dependencies No The paper mentions specific models like DINOV2 Vi T-B but does not provide version numbers for any software libraries or programming languages used for implementation.
Experiment Setup Yes For the optimization process, we utilize the Adam W optimizer [34] throughout both stages of our training, starting with a base learning rate of 0.001 and a weight decay set at 0.05. Our data is processed in batches of 64. During the second stage of training, we increase the masking ratio (rw) to 60%. To further enhance the training dynamics, we implement a cosine learning rate scheduler coupled with a drop path rate of 0.1 and include a warm-up phase of 10 epochs to facilitate a smooth adjustment to the training conditions.