FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection
Authors: Dongmei Zhang, Chang Li, Renrui Zhang, Shenghao Xie, Wei Xue, Xiaodong Xie, Shanghang Zhang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git. Experiments conducted on two public and commonly used open-vocabulary 3D point-cloud object detection datasets achieve state-of-the-art performances, demonstrating that our method is effective. |
| Researcher Affiliation | Academia | 1National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2The Chinese University of Hong Kong 3Wuhan University 4Hong Kong University of Science and Technology |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is released at https://github.com/dmzhang0425/FM-OV3D.git. |
| Open Datasets | Yes | We conduct experiments on public, widely used datasets SUN RGB-D (Song, Lichtenberg, and Xiao 2015) and Scan Net (Dai et al. 2017) in 3D object detection tasks. The provided point-cloud data and corresponding images, together with the matrix K, are used in our method. We adopt LVIS (Gupta, Dollar, and Girshick 2019) as our training vocabulary. |
| Dataset Splits | No | The paper mentions training and testing phases and datasets, but does not explicitly provide percentages or counts for training, validation, and test splits, nor does it refer to specific standard splits with these details. |
| Hardware Specification | Yes | Experiments are conducted on two NVIDIA Ge Force RTX 3090 GPUs and A100 SXM4 80GB GPUs. |
| Software Dependencies | Yes | We apply the stable-diffusion-v1-4 model commanded by GPT-3 generated prompts. CLIP version Vi T-B/32 is used for extracting features. |
| Experiment Setup | Yes | We train our model in 400 epochs. The base learning rate is set to 7e-4. We load 8 3D scenes onto each GPU in every batch. We adopt 3DETR (Misra, Girdhar, and Joulin 2021) as the 3D detector. |