Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction
Authors: Renjie Pi, Lewei Yao, Jianhua Han, Xiaodan Liang, Wei Zhang, Hang Xu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results conducted on IOD-Bench demonstrate that our model consistently outperforms baseline methods that directly combine LLMs with detection models. |
| Researcher Affiliation | Collaboration | Renjie Pi1 , Lewei Yao1 , Jianhua Han2, Xiaodan Liang3, Wei Zhang2, Hang Xu2 1Hong Kong University of Science and Technology 2Huawei Noah s Ark Lab 3Sun Yat-sen Unversity and MBZUAI |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper discusses leveraging 'open-source large-scale vision-language models' but does not provide any explicit statement or link for the source code of their own proposed method, Ins-Det CLIP. |
| Open Datasets | Yes | We create an instruction-guided detection dataset termed IOD-Bench and develop corresponding evaluation metrics. Our approach involves leveraging the class names provided by the Objects365 dataset (Shao et al., 2019) and utilizing a powerful LLM (Open AI., 2022) to generate instructions related to these classes. |
| Dataset Splits | Yes | We allocated 80% of these instructions for training, while the remaining is reserved for evaluation. The training dataset is constructed using the in-domain instructions and training images from object365, while the validation set is constructed using 20,000 images from the validation split of Object365, paired with both in-domain and out-of-domain instructions. |
| Hardware Specification | Yes | We employ a cosine decay learning rate, starting from 2.8e 4, and conduct the training over 12 epochs using 32 GPUs. In the second stage, it takes only around 24 hours on 16 V100 GPUs for instruction tuning. |
| Software Dependencies | No | The paper mentions software components like 'Flan T5' and 'CLIP' and also implies the use of a deep learning framework, but it does not provide specific version numbers for any software dependencies required for reproducibility. |
| Experiment Setup | Yes | We employ a cosine decay learning rate, starting from 2.8e 4, and conduct the training over 12 epochs using 32 GPUs. For the instruction tuning phase, if not otherwise specified, we set Pneg to 0.1, the cross attention fusion layers are inserted into the language model s decoder every 3rd layer. The models are trained for 12 epochs. The initial learning rate is set to 2.5e 5 and is decayed by a factor of 0.1 at the 8-th and the 11-th epoch. The max token length is set to 512 following (Chung et al., 2022). |