Multi-modal Queried Object Detection in the Wild

Authors: Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xiaoshan Yang, Ke Li, Changsheng Xu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the stateof-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via multimodal queries without any downstream finetuning, and averagely +6.3% AP on 13 few-shot downstream tasks, with merely additional 3% modulating time required by GLIP.
Researcher Affiliation Collaboration Yifan Xu1,3 , Mengdan Zhang2 , Chaoyou Fu2, Peixian Chen2, Xiaoshan Yang1,3,4, Ke Li2, Changsheng Xu1,3,4 1MAIS, Institute of Automation, Chinese Academy of Sciences 2Tencent Youtu Lab 3School of Artificial Intelligence, University of the Chinese Academy of Sciences 4Peng Cheng Laboratory
Pseudocode No The paper describes its methods through architectural diagrams (Figure 1) and textual explanations, but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/Yifan Xu74/MQ-Det.
Open Datasets Yes Objects365 dataset [36] is a large-scale, high-quality dataset for object detection. We use this dataset to conduct the modulated pre-training of our MQ-Det models... LVIS benchmark [13] is a challenging dataset for long-tail objects... ODin W benchmark [23] (Object Detection in the Wild) is a more challenging benchmark for evaluating model performance under real-world scenarios.
Dataset Splits Yes We report on Mini Val containing 5,000 images introduced in MDETR [20] as well as the full validation set v1.0. and During finetuning-free evaluation, we extract 5 instances as vision queries for each category from the downstream training set without any finetuning.
Hardware Specification Yes We conduct modulated pre-training of our models on the Objects365 dataset [36] for only one epoch using 8 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using components like BERT [8] and CLIP [32] but does not provide a specific list of software dependencies with version numbers (e.g., programming language versions, library versions, or specific framework versions) necessary for replication.
Experiment Setup Yes We report the hyper-parameter settings of the modulated pre-training of MQ-Det in Tab. VI. Other settings are the same with corresponding language-queried detectors. (Table VI: Item: lr of GCP, Value: 1e-5; Item: mask rate, Value: 40%)