reproducibilityindex.ai

FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection

Authors: Dongmei Zhang, Chang Li, Renrui Zhang, Shenghao Xie, Wei Xue, Xiaodong Xie, Shanghang Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git. Experiments conducted on two public and commonly used open-vocabulary 3D point-cloud object detection datasets achieve state-of-the-art performances, demonstrating that our method is effective.
Researcher Affiliation	Academia	1National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2The Chinese University of Hong Kong 3Wuhan University 4Hong Kong University of Science and Technology
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	Code is released at https://github.com/dmzhang0425/FM-OV3D.git.
Open Datasets	Yes	We conduct experiments on public, widely used datasets SUN RGB-D (Song, Lichtenberg, and Xiao 2015) and Scan Net (Dai et al. 2017) in 3D object detection tasks. The provided point-cloud data and corresponding images, together with the matrix K, are used in our method. We adopt LVIS (Gupta, Dollar, and Girshick 2019) as our training vocabulary.
Dataset Splits	No	The paper mentions training and testing phases and datasets, but does not explicitly provide percentages or counts for training, validation, and test splits, nor does it refer to specific standard splits with these details.
Hardware Specification	Yes	Experiments are conducted on two NVIDIA Ge Force RTX 3090 GPUs and A100 SXM4 80GB GPUs.
Software Dependencies	Yes	We apply the stable-diffusion-v1-4 model commanded by GPT-3 generated prompts. CLIP version Vi T-B/32 is used for extracting features.
Experiment Setup	Yes	We train our model in 400 epochs. The base learning rate is set to 7e-4. We load 8 3D scenes onto each GPU in every batch. We adopt 3DETR (Misra, Girdhar, and Joulin 2021) as the 3D detector.