FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection

Authors: Dongmei Zhang, Chang Li, Renrui Zhang, Shenghao Xie, Wei Xue, Xiaodong Xie, Shanghang Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git. Experiments conducted on two public and commonly used open-vocabulary 3D point-cloud object detection datasets achieve state-of-the-art performances, demonstrating that our method is effective.
Researcher Affiliation Academia 1National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2The Chinese University of Hong Kong 3Wuhan University 4Hong Kong University of Science and Technology
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Code is released at https://github.com/dmzhang0425/FM-OV3D.git.
Open Datasets Yes We conduct experiments on public, widely used datasets SUN RGB-D (Song, Lichtenberg, and Xiao 2015) and Scan Net (Dai et al. 2017) in 3D object detection tasks. The provided point-cloud data and corresponding images, together with the matrix K, are used in our method. We adopt LVIS (Gupta, Dollar, and Girshick 2019) as our training vocabulary.
Dataset Splits No The paper mentions training and testing phases and datasets, but does not explicitly provide percentages or counts for training, validation, and test splits, nor does it refer to specific standard splits with these details.
Hardware Specification Yes Experiments are conducted on two NVIDIA Ge Force RTX 3090 GPUs and A100 SXM4 80GB GPUs.
Software Dependencies Yes We apply the stable-diffusion-v1-4 model commanded by GPT-3 generated prompts. CLIP version Vi T-B/32 is used for extracting features.
Experiment Setup Yes We train our model in 400 epochs. The base learning rate is set to 7e-4. We load 8 3D scenes onto each GPU in every batch. We adopt 3DETR (Misra, Girdhar, and Joulin 2021) as the 3D detector.