DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection
Authors: Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing XU, Hang Xu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our Det CLIP-T outperforms GLIP-T by 9.9% m AP and obtains a 13.5% improvement on rare categories compared to the fully-supervised model with the same backbone as ours. |
| Researcher Affiliation | Collaboration | 1Hong Kong University of Science and Technology, 2Huawei Noah s Ark Lab 3Shenzhen Campus of Sun Yat-Sen University |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We already include the data and instructions in Section 4, and the code will be released upon acceptance. |
| Open Datasets | Yes | Our model is trained with a hybrid supervision from different kinds of data, i.e., detection data, grounding data, and image-text pair data. More specifically, for detection data, we use a sampled Objects365 V2 [43] dataset (denoted as O365 in the following sections) with 0.66M training images. ... For grounding data, we use gold grounding data (denoted as Gold G) introduced by MDETR [20]. ... For image-text pair data, we perform object-level dense pseudo labeling on YFCC100m [45] dataset with a pre-trained CLIP [36] model... |
| Dataset Splits | Yes | We evaluate our method mainly on LVIS [16] which contains 1203 categories. Following GLIP [28] and MDETR [20], we evaluate on the 5k minival subset and report the zero-shot fixed AP [9] for a fair comparison. ... Results on LVIS full validation dataset can refer to the Appendix. |
| Hardware Specification | Yes | We pre-train all the models based on Swin-Transformer [33] backbones with 32 GPUs. ... With the same setting of training with 32 V100 GPUs, the total training time for GLIP-T is about 10.7K GPU hours (5X than us) due to its heavy backbone and more image-text pair training data. |
| Software Dependencies | No | MMDetection [6] code-base is used. |
| Experiment Setup | Yes | Adam W optimizer [22] is adopted and batch size is set to 128. The learning rate is set to 2.8x10 4 for the parameters of the visual backbone and detection head, and 2.8x10 5 for the language backbone. Without otherwise specified, all models are trained with 12 epochs and the learning rate is decayed with a factor of 0.1 at the 8-th and the 11-th epoch. The max token length for each input sentence is set to 48. The number of the concepts N in text input P is set to 150 and the number of region features M is determined by the feature map size and the number of pre-defined anchors. The loss weight factors α and β are both set to 1.0. |