InstructDET: Diversifying Referring Object Detection with Generalized Instructions
Authors: Ronghao Dang, Jiangyan Feng, Haodong Zhang, Chongjian GE, Lin Song, Lijun GONG, Chengju Liu, Qijun Chen, Feng Zhu, Rui Zhao, Yibing Song
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By using our In DET dataset, we show that a conventional ROD model surpasses existing methods on both standard REC datasets and our In DET test set. Instruct DET, our data-centric method with automatic data expansion by leveraging foundation models, directs a promising field that ROD can be greatly diversified to execute common object detection instructions. |
| Researcher Affiliation | Collaboration | 1Tongji University 2Sense Time Research 3The University of Hong Kong 4Tencent AI Lab 5Alibaba DAMO Academy |
| Pseudocode | No | The paper includes figures illustrating pipelines and architectures (e.g., Figure 2: An overview of our Instruct DET, Figure 13: A detailed overview of our diversified referring object detection (DROD) model), but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | Yes | The code is available at https://github.com/jy Feng Go Go/ Instruct Det. |
| Open Datasets | Yes | Our In DET dataset contains images from MSCOCO (Lin et al., 2014), Flicker (Plummer et al., 2015), and Objects365 (Shao et al., 2019). |
| Dataset Splits | Yes | We split the images into training, validation, and testing sets, with the corresponding instruction amount of 3139K, 240K, and 247K, respectively. |
| Hardware Specification | Yes | The model is trained on 32 and 16 V100 GPUs for pretraining and finetuning, respectively. |
| Software Dependencies | No | The paper mentions using specific models and components like 'Bert', 'ViT-Huge', 'Q-Former', 'Mini GPT-4 weights', and 'Vicuna 13B' as the LLM. It also mentions 'Adam W' as the optimizer. However, it does not provide specific version numbers for these software dependencies (e.g., PyTorch 1.x, TensorFlow 2.x, or specific library versions for BERT/ViT implementations). |
| Experiment Setup | Yes | The transformer encoder-decoder architecture consists of a six-layer encoder and a six-layer decoder. The number of object queries N is set to 900. Our DROD model is initialized by weights pretrained on Objects365 released by UNINEXT (Yan et al., 2023). The optimizer we use is Adam W Loshchilov & Hutter (2019) with a learning rate of 2e-4, a weight decay of 0.05 and the warm-up steps are 400 with an initial learning rate of 4e-5. |