D3ETR: Decoder Distillation for Detection Transformer

Authors: Xiaokang Chen, Jiahui Chen, Yan Liu, Jiaxiang Tang, Gang Zeng

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform the experiments on the COCO 2017 [Lin et al., 2014b] detection dataset, which contains about 118K training (train) images and 5K validation (val) images. ... Our D3ETR obtains consistent gains over different backbones. ... The results are presented in Table 1. All the student detectors obtain significant m AP improvements with the knowledge transferred from teacher detectors. ... In this section, we first compare the proposed decoder distillation method to other CNN-based distillation methods in object detection. Subsequently, we conduct ablation studies to verify each component in our decoder distillation strategies.
Researcher Affiliation Academia Xiaokang Chen1 , Jiahui Chen2 , Yan Liu3 , Jiaxiang Tang1 and Gang Zeng1 1National Key Laboratory of General Artificial Intelligence, School of IST, Peking University 2Beihang University 3The Chinese University of Hong Kong
Pseudocode No The paper describes the proposed methods (Mix Matcher, D3ETR) in detail but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The code will be released.
Open Datasets Yes We perform the experiments on the COCO 2017 [Lin et al., 2014b] detection dataset, which contains about 118K training (train) images and 5K validation (val) images.
Dataset Splits Yes We perform the experiments on the COCO 2017 [Lin et al., 2014b] detection dataset, which contains about 118K training (train) images and 5K validation (val) images.
Hardware Specification No The paper mentions using backbones like 'Res Net-101-C5' but does not specify any hardware components (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No We follow the training setting of DETR [Carion et al., 2020] and Conditional DETR [Meng et al., 2021] that use Image Net pre-trained backbone from TORCHVISION with Batch Normalisation (BN) layers fixed. The transformer parameters are initialized using the Xavier initialization scheme [Glorot and Bengio, 2010]. We train our models for 12/50 epochs utilizing the Adam W [Loshchilov and Hutter, 2017] optimizer. ... The paper mentions software like TORCHVISION and Adam W optimizer, but does not provide specific version numbers for these or other software libraries.
Experiment Setup Yes We train our models for 12/50 epochs utilizing the Adam W [Loshchilov and Hutter, 2017] optimizer. The learning rate is reduced by a factor of 10 after 11/40 epochs, respectively. ... The data augmentation scheme is identical to DETR [Carion et al., 2020]: the input image is resized such that the short side is at least 480 pixels and at most 800 pixels and the long side is at most 1333 pixels. The training image is then randomly cropped with a probability of 0.5 to a random rectangular patch. ... µcls = 20 is the tradeoff coefficient. ℓbox is a combination of ℓ1 loss and GIo U loss [Rezatofighi et al., 2019], with loss weights of 10 and 2, respectively. ... λsa is the loss weight and set to 10, 000 as default. ... λca defaults to 10, 000.