Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding
Authors: Shilong Liu, Shijia Huang, Feng Li, Hao Zhang, Yaoyuan Liang, Hang Su, Jun Zhu, Lei Zhang
AAAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision)...our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a Res Net-101 backbone. |
| Researcher Affiliation | Collaboration | Shilong Liu1,2*, Shijia Huang3, Feng Li2,4, Hao Zhang2,4, Yaoyuan Liang5, Hang Su1, Jun Zhu1 , Lei Zhang2 1 Dept. of CST, BNRist Center, Inst. for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University. 2 International Digital Economy Academy (IDEA). 3 The Chinese University of Hong Kong. 4 The Hong Kong University of Science and Technology. 5 Tsinghua-Berkeley Shenzhen Institute, Tsinghua University. |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found. |
| Open Source Code | No | Code will be available at https://github.com/IDEA-Research/DQDETR. |
| Open Datasets | Yes | We use two commonly used image backbones, Res Net-50 and Res Net-101 (He et al. 2016) pre-trained on Image Net (Deng et al. 2009)... Following MDETR (Kamath et al. 2021), we use the combined dataset of Flickr30k, COCO, and Visual Genome for our pre-training. |
| Dataset Splits | Yes | For the ANY-BOX protocol, we evaluate our pre-trained model on the validation and test splits directly. For the MERGED-BOXES protocol, we fine-tune the pre-trained model for 5 epochs. |
| Hardware Specification | Yes | The pre-training takes about 100 hours on 16 Nvidia A100 GPUs with 4 images per GPU. |
| Software Dependencies | No | The paper mentions software like PyTorch, Hugging Face, and spaCy, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We set D = 256 and D1 = 64 in our implementations and use 100 pairs of dual queries. Our models use 6 encoder layers and 6 decoder layers. The initial learning rates for the Transformer encoder-decoder and image backbone are 1e 4 and 1e 5, respectively. For the text backbone, we use a linear decay schedule from 5e 5 to 0 with a linear warm-up in the first 1% steps. |