DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection

Authors: Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing XU, Hang Xu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our Det CLIP-T outperforms GLIP-T by 9.9% m AP and obtains a 13.5% improvement on rare categories compared to the fully-supervised model with the same backbone as ours.
Researcher Affiliation Collaboration 1Hong Kong University of Science and Technology, 2Huawei Noah s Ark Lab 3Shenzhen Campus of Sun Yat-Sen University
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No We already include the data and instructions in Section 4, and the code will be released upon acceptance.
Open Datasets Yes Our model is trained with a hybrid supervision from different kinds of data, i.e., detection data, grounding data, and image-text pair data. More specifically, for detection data, we use a sampled Objects365 V2 [43] dataset (denoted as O365 in the following sections) with 0.66M training images. ... For grounding data, we use gold grounding data (denoted as Gold G) introduced by MDETR [20]. ... For image-text pair data, we perform object-level dense pseudo labeling on YFCC100m [45] dataset with a pre-trained CLIP [36] model...
Dataset Splits Yes We evaluate our method mainly on LVIS [16] which contains 1203 categories. Following GLIP [28] and MDETR [20], we evaluate on the 5k minival subset and report the zero-shot fixed AP [9] for a fair comparison. ... Results on LVIS full validation dataset can refer to the Appendix.
Hardware Specification Yes We pre-train all the models based on Swin-Transformer [33] backbones with 32 GPUs. ... With the same setting of training with 32 V100 GPUs, the total training time for GLIP-T is about 10.7K GPU hours (5X than us) due to its heavy backbone and more image-text pair training data.
Software Dependencies No MMDetection [6] code-base is used.
Experiment Setup Yes Adam W optimizer [22] is adopted and batch size is set to 128. The learning rate is set to 2.8x10 4 for the parameters of the visual backbone and detection head, and 2.8x10 5 for the language backbone. Without otherwise specified, all models are trained with 12 epochs and the learning rate is decayed with a factor of 0.1 at the 8-th and the 11-th epoch. The max token length for each input sentence is set to 48. The number of the concepts N in text input P is set to 150 and the number of region features M is determined by the feature map size and the number of pre-defined anchors. The loss weight factors α and β are both set to 1.0.