Towards Label-free Scene Understanding by Vision Foundation Models

Authors: Runnan Chen, Youquan Liu, Lingdong Kong, Nenglun Chen, Xinge ZHU, Yuexin Ma, Tongliang Liu, Wenping Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments.
Researcher Affiliation Academia Runnan Chen1 Youquan Liu2 Lingdong Kong3 Nenglun Chen1 Xinge Zhu4 Yuexin Ma5 Tongliang Liu6 Wenping Wang7 1The University of Hong Kong 2Hochschule Bremerhaven 3National University of Singapore 4The Chinese University of Hong Kong 5Shanghai Tech University 6The University of Sydney 7Texas A&M University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available.2 (footnote 2: https://github.com/runnanchen/Label-Free-Scene-Understanding.)
Open Datasets Yes To evaluate the superior performance and generalization capability of our method for scene understanding, we conduct experiments on both indoor and outdoor public datasets, namely Scan Net [91], nu Scenes [92] and nu Images [93]
Dataset Splits Yes Scan Net [91] consists of 1,603 indoor scans, collected by RGB-D camera, with 20 classes, where 1,201 scans are allocated for training, 312 scans for validation, and 100 scans for testing. Additionally, we utilize 25,000 key frame images to train the 2D network. The nu Scenes [92] dataset, collected in traffic scenarios by Li DAR and RGB camera, comprises 700 scenes for training, 150 for validation, and 150 for testing, focusing on Li DAR semantic segmentation with 16 classes. To be more specific, we leverage a total of 24,109 sweeps of Li DAR scans for training and 4,021 sweeps for validation. Each sweep is accompanied by six camera images, providing a comprehensive 360-degree view. The nu Images [93] dataset provides 93,000 2D annotated images sourced from a significantly larger dataset. This includes 67,279 images for training, 16,445 for validation, and 9,752 for testing.
Hardware Specification Yes Our framework is developed using Py Torch and trained on two NVIDIA Tesla A100 GPUs.
Software Dependencies No The paper mentions "Py Torch" but does not provide specific version numbers for software dependencies.
Experiment Setup Yes During training, both CLIP and SAM are kept frozen. For prediction consistency regularization, we transition to stage two after ten epochs of stage one. To enhance the robustness of our model, we apply various data augmentations, such as random rotation along the z-axis and random flip for point clouds, as well as random horizontal flip, random crop, and random resize for images. For the Scan Net dataset, the training process takes approximately 10 hours for 30 epochs, with the image number set to 16. In the case of the nu Scenes dataset, the training time is 40 hours for 20 epochs, with the image number set to 6.