Towards Label-free Scene Understanding by Vision Foundation Models
Authors: Runnan Chen, Youquan Liu, Lingdong Kong, Nenglun Chen, Xinge ZHU, Yuexin Ma, Tongliang Liu, Wenping Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. |
| Researcher Affiliation | Academia | Runnan Chen1 Youquan Liu2 Lingdong Kong3 Nenglun Chen1 Xinge Zhu4 Yuexin Ma5 Tongliang Liu6 Wenping Wang7 1The University of Hong Kong 2Hochschule Bremerhaven 3National University of Singapore 4The Chinese University of Hong Kong 5Shanghai Tech University 6The University of Sydney 7Texas A&M University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available.2 (footnote 2: https://github.com/runnanchen/Label-Free-Scene-Understanding.) |
| Open Datasets | Yes | To evaluate the superior performance and generalization capability of our method for scene understanding, we conduct experiments on both indoor and outdoor public datasets, namely Scan Net [91], nu Scenes [92] and nu Images [93] |
| Dataset Splits | Yes | Scan Net [91] consists of 1,603 indoor scans, collected by RGB-D camera, with 20 classes, where 1,201 scans are allocated for training, 312 scans for validation, and 100 scans for testing. Additionally, we utilize 25,000 key frame images to train the 2D network. The nu Scenes [92] dataset, collected in traffic scenarios by Li DAR and RGB camera, comprises 700 scenes for training, 150 for validation, and 150 for testing, focusing on Li DAR semantic segmentation with 16 classes. To be more specific, we leverage a total of 24,109 sweeps of Li DAR scans for training and 4,021 sweeps for validation. Each sweep is accompanied by six camera images, providing a comprehensive 360-degree view. The nu Images [93] dataset provides 93,000 2D annotated images sourced from a significantly larger dataset. This includes 67,279 images for training, 16,445 for validation, and 9,752 for testing. |
| Hardware Specification | Yes | Our framework is developed using Py Torch and trained on two NVIDIA Tesla A100 GPUs. |
| Software Dependencies | No | The paper mentions "Py Torch" but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | During training, both CLIP and SAM are kept frozen. For prediction consistency regularization, we transition to stage two after ten epochs of stage one. To enhance the robustness of our model, we apply various data augmentations, such as random rotation along the z-axis and random flip for point clouds, as well as random horizontal flip, random crop, and random resize for images. For the Scan Net dataset, the training process takes approximately 10 hours for 30 epochs, with the image number set to 16. In the case of the nu Scenes dataset, the training time is 40 hours for 20 epochs, with the image number set to 6. |