Multi-modal Queried Object Detection in the Wild
Authors: Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xiaoshan Yang, Ke Li, Changsheng Xu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the stateof-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via multimodal queries without any downstream finetuning, and averagely +6.3% AP on 13 few-shot downstream tasks, with merely additional 3% modulating time required by GLIP. |
| Researcher Affiliation | Collaboration | Yifan Xu1,3 , Mengdan Zhang2 , Chaoyou Fu2, Peixian Chen2, Xiaoshan Yang1,3,4, Ke Li2, Changsheng Xu1,3,4 1MAIS, Institute of Automation, Chinese Academy of Sciences 2Tencent Youtu Lab 3School of Artificial Intelligence, University of the Chinese Academy of Sciences 4Peng Cheng Laboratory |
| Pseudocode | No | The paper describes its methods through architectural diagrams (Figure 1) and textual explanations, but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Yifan Xu74/MQ-Det. |
| Open Datasets | Yes | Objects365 dataset [36] is a large-scale, high-quality dataset for object detection. We use this dataset to conduct the modulated pre-training of our MQ-Det models... LVIS benchmark [13] is a challenging dataset for long-tail objects... ODin W benchmark [23] (Object Detection in the Wild) is a more challenging benchmark for evaluating model performance under real-world scenarios. |
| Dataset Splits | Yes | We report on Mini Val containing 5,000 images introduced in MDETR [20] as well as the full validation set v1.0. and During finetuning-free evaluation, we extract 5 instances as vision queries for each category from the downstream training set without any finetuning. |
| Hardware Specification | Yes | We conduct modulated pre-training of our models on the Objects365 dataset [36] for only one epoch using 8 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using components like BERT [8] and CLIP [32] but does not provide a specific list of software dependencies with version numbers (e.g., programming language versions, library versions, or specific framework versions) necessary for replication. |
| Experiment Setup | Yes | We report the hyper-parameter settings of the modulated pre-training of MQ-Det in Tab. VI. Other settings are the same with corresponding language-queried detectors. (Table VI: Item: lr of GCP, Value: 1e-5; Item: mask rate, Value: 40%) |