reproducibilityindex.ai

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

Authors: Shilong Liu, Shijia Huang, Feng Li, Hao Zhang, Yaoyuan Liang, Hang Su, Jun Zhu, Lei Zhang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision)...our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a Res Net-101 backbone.
Researcher Affiliation	Collaboration	Shilong Liu1,2*, Shijia Huang3, Feng Li2,4, Hao Zhang2,4, Yaoyuan Liang5, Hang Su1, Jun Zhu1 , Lei Zhang2 1 Dept. of CST, BNRist Center, Inst. for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University. 2 International Digital Economy Academy (IDEA). 3 The Chinese University of Hong Kong. 4 The Hong Kong University of Science and Technology. 5 Tsinghua-Berkeley Shenzhen Institute, Tsinghua University.
Pseudocode	No	No pseudocode or clearly labeled algorithm block was found.
Open Source Code	No	Code will be available at https://github.com/IDEA-Research/DQDETR.
Open Datasets	Yes	We use two commonly used image backbones, Res Net-50 and Res Net-101 (He et al. 2016) pre-trained on Image Net (Deng et al. 2009)... Following MDETR (Kamath et al. 2021), we use the combined dataset of Flickr30k, COCO, and Visual Genome for our pre-training.
Dataset Splits	Yes	For the ANY-BOX protocol, we evaluate our pre-trained model on the validation and test splits directly. For the MERGED-BOXES protocol, we fine-tune the pre-trained model for 5 epochs.
Hardware Specification	Yes	The pre-training takes about 100 hours on 16 Nvidia A100 GPUs with 4 images per GPU.
Software Dependencies	No	The paper mentions software like PyTorch, Hugging Face, and spaCy, but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We set D = 256 and D1 = 64 in our implementations and use 100 pairs of dual queries. Our models use 6 encoder layers and 6 decoder layers. The initial learning rates for the Transformer encoder-decoder and image backbone are 1e 4 and 1e 5, respectively. For the text backbone, we use a linear decay schedule from 5e 5 to 0 with a linear warm-up in the first 1% steps.