InstructDET: Diversifying Referring Object Detection with Generalized Instructions

Authors: Ronghao Dang, Jiangyan Feng, Haodong Zhang, Chongjian GE, Lin Song, Lijun GONG, Chengju Liu, Qijun Chen, Feng Zhu, Rui Zhao, Yibing Song

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By using our In DET dataset, we show that a conventional ROD model surpasses existing methods on both standard REC datasets and our In DET test set. Instruct DET, our data-centric method with automatic data expansion by leveraging foundation models, directs a promising field that ROD can be greatly diversified to execute common object detection instructions.
Researcher Affiliation Collaboration 1Tongji University 2Sense Time Research 3The University of Hong Kong 4Tencent AI Lab 5Alibaba DAMO Academy
Pseudocode No The paper includes figures illustrating pipelines and architectures (e.g., Figure 2: An overview of our Instruct DET, Figure 13: A detailed overview of our diversified referring object detection (DROD) model), but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code Yes The code is available at https://github.com/jy Feng Go Go/ Instruct Det.
Open Datasets Yes Our In DET dataset contains images from MSCOCO (Lin et al., 2014), Flicker (Plummer et al., 2015), and Objects365 (Shao et al., 2019).
Dataset Splits Yes We split the images into training, validation, and testing sets, with the corresponding instruction amount of 3139K, 240K, and 247K, respectively.
Hardware Specification Yes The model is trained on 32 and 16 V100 GPUs for pretraining and finetuning, respectively.
Software Dependencies No The paper mentions using specific models and components like 'Bert', 'ViT-Huge', 'Q-Former', 'Mini GPT-4 weights', and 'Vicuna 13B' as the LLM. It also mentions 'Adam W' as the optimizer. However, it does not provide specific version numbers for these software dependencies (e.g., PyTorch 1.x, TensorFlow 2.x, or specific library versions for BERT/ViT implementations).
Experiment Setup Yes The transformer encoder-decoder architecture consists of a six-layer encoder and a six-layer decoder. The number of object queries N is set to 900. Our DROD model is initialized by weights pretrained on Objects365 released by UNINEXT (Yan et al., 2023). The optimizer we use is Adam W Loshchilov & Hutter (2019) with a learning rate of 2e-4, a weight decay of 0.05 and the warm-up steps are 400 with an initial learning rate of 4e-5.