Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Authors: Shenghao Fu, Junkai Yan, Qize Yang, Xihan Wei, Xiaohua Xie, Wei-Shi Zheng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With such a novel paradigm, we boost the SOTA query-based detector DINO from 49.0% AP to 51.9% AP (+2.9% AP) and further to 53.8% AP (+4.8% AP) by integrating one or two foundation models respectively, on the COCO validation set after training for 12 epochs with R50 as the detector s backbone. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Sun Yat-sen University, China; 2Peng Cheng Laboratory, Shenzhen, 518055, China; 3Tongyi Lab, Alibaba Group; 4Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China; 5Guangdong Province Key Laboratory of Information Security Technology, China; 6Pazhou Laboratory (Huangpu), Guangzhou, Guangdong 510555, China |
| Pseudocode | No | The paper describes its methods in prose and with figures, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We will release all the code and models upon acceptance. |
| Open Datasets | Yes | All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... We choose DINO-det-4scale [66] as the baseline and train the model for 24 epochs without using mask annotations. Following common practices, we use repeat factor sampling and Federated Loss [72]. ... The LVIS dataset is a large vocabulary dataset (1203 classes) with long tail distribution. ... We also validate the open-vocabulary ability inherited by Frozen-DETR. ... We directly transfer the model trained on the COCO dataset to the COCO-O dataset [50] without fine-tuning |
| Dataset Splits | Yes | All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... on the COCO validation set after training for 12 epochs with R50 as the detector s backbone. |
| Hardware Specification | Yes | All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... For the training, we use 4 A100 GPUs with 2 images per GPU except for the Vi T-L backbone due to out-of-memory (OOM). For inference, we use a V100 GPU with batch size 1 in line with the main text. |
| Software Dependencies | No | The paper references various models and methods (e.g., CLIP, DINO), but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | All experiments are conducted on the COCO [43] dataset with 300 queries, 12 training epochs, and 4 V100 GPUs. ... Unless otherwise specified, we employ the Image Net-1k [17] supervised pre-training Res Net-50 (R50) [30] as the backbone of the detector. ... All experiments are conducted on the COCO [43] dataset with R50, 900 queries, 12 training epochs, and 4 V100 GPUs. ... We choose DINO-det-4scale [66] as the baseline and train the model for 24 epochs without using mask annotations. Following common practices, we use repeat factor sampling and Federated Loss [72]. |