reproducibilityindex.ai

Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models

Authors: Shenghao Fu, Junkai Yan, Qize Yang, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With such a novel paradigm, we boost the SOTA query-based detector DINO from 49.0% AP to 51.9% AP (+2.9% AP) and further to 53.8% AP (+4.8% AP) by integrating one or two foundation models respectively, on the COCO validation set after training for 12 epochs with R50 as the detector s backbone.
Researcher Affiliation	Collaboration	1School of Computer Science and Engineering, Sun Yat-sen University, China; 2Peng Cheng Laboratory, Shenzhen, 518055, China; 3Tongyi Lab, Alibaba Group; 4Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China; 5Guangdong Province Key Laboratory of Information Security Technology, China; 6Pazhou Laboratory (Huangpu), Guangzhou, Guangdong 510555, China
Pseudocode	No	The paper describes its methods in prose and with figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	We will release all the code and models upon acceptance.
Open Datasets	Yes	All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... We choose DINO-det-4scale [66] as the baseline and train the model for 24 epochs without using mask annotations. Following common practices, we use repeat factor sampling and Federated Loss [72]. ... The LVIS dataset is a large vocabulary dataset (1203 classes) with long tail distribution. ... We also validate the open-vocabulary ability inherited by Frozen-DETR. ... We directly transfer the model trained on the COCO dataset to the COCO-O dataset [50] without fine-tuning
Dataset Splits	Yes	All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... on the COCO validation set after training for 12 epochs with R50 as the detector s backbone.
Hardware Specification	Yes	All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... For the training, we use 4 A100 GPUs with 2 images per GPU except for the Vi T-L backbone due to out-of-memory (OOM). For inference, we use a V100 GPU with batch size 1 in line with the main text.
Software Dependencies	No	The paper references various models and methods (e.g., CLIP, DINO), but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... Unless otherwise specified, we employ the Image Net-1k [17] supervised pre-training Res Net-50 (R50) [30] as the backbone of the detector. ... All experiments are conducted on the COCO [43] dataset with R50, 900 queries, 12 training epochs, and 4 V100 GPUs. ... We choose DINO-det-4scale [66] as the baseline and train the model for 24 epochs without using mask annotations. Following common practices, we use repeat factor sampling and Federated Loss [72].