Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models

Authors: Shenghao Fu, Junkai Yan, Qize Yang, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With such a novel paradigm, we boost the SOTA query-based detector DINO from 49.0% AP to 51.9% AP (+2.9% AP) and further to 53.8% AP (+4.8% AP) by integrating one or two foundation models respectively, on the COCO validation set after training for 12 epochs with R50 as the detector s backbone.
Researcher Affiliation Collaboration 1School of Computer Science and Engineering, Sun Yat-sen University, China; 2Peng Cheng Laboratory, Shenzhen, 518055, China; 3Tongyi Lab, Alibaba Group; 4Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China; 5Guangdong Province Key Laboratory of Information Security Technology, China; 6Pazhou Laboratory (Huangpu), Guangzhou, Guangdong 510555, China
Pseudocode No The paper describes its methods in prose and with figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We will release all the code and models upon acceptance.
Open Datasets Yes All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... We choose DINO-det-4scale [66] as the baseline and train the model for 24 epochs without using mask annotations. Following common practices, we use repeat factor sampling and Federated Loss [72]. ... The LVIS dataset is a large vocabulary dataset (1203 classes) with long tail distribution. ... We also validate the open-vocabulary ability inherited by Frozen-DETR. ... We directly transfer the model trained on the COCO dataset to the COCO-O dataset [50] without fine-tuning
Dataset Splits Yes All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... on the COCO validation set after training for 12 epochs with R50 as the detector s backbone.
Hardware Specification Yes All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... For the training, we use 4 A100 GPUs with 2 images per GPU except for the Vi T-L backbone due to out-of-memory (OOM). For inference, we use a V100 GPU with batch size 1 in line with the main text.
Software Dependencies No The paper references various models and methods (e.g., CLIP, DINO), but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... Unless otherwise specified, we employ the Image Net-1k [17] supervised pre-training Res Net-50 (R50) [30] as the backbone of the detector. ... All experiments are conducted on the COCO [43] dataset with R50, 900 queries, 12 training epochs, and 4 V100 GPUs. ... We choose DINO-det-4scale [66] as the baseline and train the model for 24 epochs without using mask annotations. Following common practices, we use repeat factor sampling and Federated Loss [72].