DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Authors: Shilong Liu, Shijia Huang, Feng Li, Hao Zhang, Yaoyuan Liang, Hang Su, Jun Zhu, Lei Zhang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision)...our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a Res Net-101 backbone.
Researcher Affiliation Collaboration Shilong Liu1,2*, Shijia Huang3, Feng Li2,4, Hao Zhang2,4, Yaoyuan Liang5, Hang Su1, Jun Zhu1 , Lei Zhang2 1 Dept. of CST, BNRist Center, Inst. for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University. 2 International Digital Economy Academy (IDEA). 3 The Chinese University of Hong Kong. 4 The Hong Kong University of Science and Technology. 5 Tsinghua-Berkeley Shenzhen Institute, Tsinghua University.
Pseudocode No No pseudocode or clearly labeled algorithm block was found.
Open Source Code No Code will be available at https://github.com/IDEA-Research/DQDETR.
Open Datasets Yes We use two commonly used image backbones, Res Net-50 and Res Net-101 (He et al. 2016) pre-trained on Image Net (Deng et al. 2009)... Following MDETR (Kamath et al. 2021), we use the combined dataset of Flickr30k, COCO, and Visual Genome for our pre-training.
Dataset Splits Yes For the ANY-BOX protocol, we evaluate our pre-trained model on the validation and test splits directly. For the MERGED-BOXES protocol, we fine-tune the pre-trained model for 5 epochs.
Hardware Specification Yes The pre-training takes about 100 hours on 16 Nvidia A100 GPUs with 4 images per GPU.
Software Dependencies No The paper mentions software like PyTorch, Hugging Face, and spaCy, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We set D = 256 and D1 = 64 in our implementations and use 100 pairs of dual queries. Our models use 6 encoder layers and 6 decoder layers. The initial learning rates for the Transformer encoder-decoder and image backbone are 1e 4 and 1e 5, respectively. For the text backbone, we use a linear decay schedule from 5e 5 to 0 with a linear warm-up in the first 1% steps.