DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
Authors: Shilong Liu, Shijia Huang, Feng Li, Hao Zhang, Yaoyuan Liang, Hang Su, Jun Zhu, Lei Zhang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision)...our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a Res Net-101 backbone. |
| Researcher Affiliation | Collaboration | Shilong Liu1,2*, Shijia Huang3, Feng Li2,4, Hao Zhang2,4, Yaoyuan Liang5, Hang Su1, Jun Zhu1 , Lei Zhang2 1 Dept. of CST, BNRist Center, Inst. for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University. 2 International Digital Economy Academy (IDEA). 3 The Chinese University of Hong Kong. 4 The Hong Kong University of Science and Technology. 5 Tsinghua-Berkeley Shenzhen Institute, Tsinghua University. |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found. |
| Open Source Code | No | Code will be available at https://github.com/IDEA-Research/DQDETR. |
| Open Datasets | Yes | We use two commonly used image backbones, Res Net-50 and Res Net-101 (He et al. 2016) pre-trained on Image Net (Deng et al. 2009)... Following MDETR (Kamath et al. 2021), we use the combined dataset of Flickr30k, COCO, and Visual Genome for our pre-training. |
| Dataset Splits | Yes | For the ANY-BOX protocol, we evaluate our pre-trained model on the validation and test splits directly. For the MERGED-BOXES protocol, we fine-tune the pre-trained model for 5 epochs. |
| Hardware Specification | Yes | The pre-training takes about 100 hours on 16 Nvidia A100 GPUs with 4 images per GPU. |
| Software Dependencies | No | The paper mentions software like PyTorch, Hugging Face, and spaCy, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We set D = 256 and D1 = 64 in our implementations and use 100 pairs of dual queries. Our models use 6 encoder layers and 6 decoder layers. The initial learning rates for the Transformer encoder-decoder and image backbone are 1e 4 and 1e 5, respectively. For the text backbone, we use a linear decay schedule from 5e 5 to 0 with a linear warm-up in the first 1% steps. |