Anchor DETR: Query Design for Transformer-Based Detector

Authors: Yingming Wang, Xiangyu Zhang, Tong Yang, Jian Sun2567-2575

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the MSCOCO benchmark prove the effectiveness of the proposed methods. For example, it achieves 44.2 AP with 19 FPS on the MSCOCO dataset when using the Res Net50-DC5 feature for training 50 epochs.
Researcher Affiliation Industry MEGVII Technology {wangyingming, zhangxiangyu, yangtong, sunjian}@megvii.com
Pseudocode No The paper describes its methods using mathematical formulations and text, but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/megvii-research/Anchor DETR.
Open Datasets Yes We conduct the experiments on the MS COCO (Lin et al. 2014) benchmark. All models are trained on the train2017 split and evaluated on the val2017.
Dataset Splits Yes All models are trained on the train2017 split and evaluated on the val2017.
Hardware Specification No The paper states 'All models are trained on 8 GPUs' but does not specify the exact GPU models, CPU, or any other hardware specifications used for the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, such as deep learning frameworks (e.g., PyTorch, TensorFlow) or programming languages (e.g., Python).
Experiment Setup Yes We train our model on the training set for 50 epochs with the Adam W optimizer (Loshchilov and Hutter 2019) setting the initial learning rate to 10 5 for the backbone and 10 4 for the others. The learning rate will be decayed by a factor of 0.1 at the 40th epoch. We set the weight decay to 10 4 and the dropout rate to 0 (i.e. remove dropout). The head for the attention is 8, the attention feature channel is 256 and the hidden dimension of the feed-forward network is 1024. We choose the number of anchor points to 300 and the number of patterns to 3 by default. We use a set of learned points as anchor points by default. The number of encoder layers and decoder layers is 6 like DETR. We use the focal loss (Lin et al. 2017) as the classification loss following the Deformable DETR.