SAM-Guided Masked Token Prediction for 3D Scene Understanding
Authors: Zhimin Chen, Liang Yang, Yingwei Li, Longlong Jing, Bing Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our methodology has been validated across multiple datasets, including SUN RGB-D, Scan Net, and S3DIS, for tasks like 3D object detection and semantic segmentation. The results demonstrate significant improvements over current State-of-the-art self-supervised methods, establishing new benchmarks in this field. |
| Researcher Affiliation | Academia | Zhimin Chen1 Clemson University Liang Yang2 The City University of New York Yingwei Li3 Johns Hopkins University Longlong Jing2 The City University of New York Bing Li B1 Clemson University |
| Pseudocode | No | The paper describes methods and processes in narrative text and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Code will be released upon acceptance. |
| Open Datasets | Yes | We validated our methodology across multiple datasets and tasks, including SUN RGB-D [46] and Scan Net [14] for 3D object detection, and S3DIS [4] and Scan Net [14] for 3D semantic segmentation. |
| Dataset Splits | Yes | We follow the official protocol for training/validation splits, extracting 78,000 frames from the training subset by sampling one frame every 25 frames to construct our dataset. |
| Hardware Specification | Yes | The training is conducted using four A100 GPUs. |
| Software Dependencies | No | The paper mentions specific models like DINOV2 Vi T-B but does not provide version numbers for any software libraries or programming languages used for implementation. |
| Experiment Setup | Yes | For the optimization process, we utilize the Adam W optimizer [34] throughout both stages of our training, starting with a base learning rate of 0.001 and a weight decay set at 0.05. Our data is processed in batches of 64. During the second stage of training, we increase the masking ratio (rw) to 60%. To further enhance the training dynamics, we implement a cosine learning rate scheduler coupled with a drop path rate of 0.1 and include a warm-up phase of 10 epochs to facilitate a smooth adjustment to the training conditions. |