reproducibilityindex.ai

Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction

Authors: Renjie Pi, Lewei Yao, Jianhua Han, Xiaodan Liang, Wei Zhang, Hang Xu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results conducted on IOD-Bench demonstrate that our model consistently outperforms baseline methods that directly combine LLMs with detection models.
Researcher Affiliation	Collaboration	Renjie Pi1 , Lewei Yao1 , Jianhua Han2, Xiaodan Liang3, Wei Zhang2, Hang Xu2 1Hong Kong University of Science and Technology 2Huawei Noah s Ark Lab 3Sun Yat-sen Unversity and MBZUAI
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper discusses leveraging 'open-source large-scale vision-language models' but does not provide any explicit statement or link for the source code of their own proposed method, Ins-Det CLIP.
Open Datasets	Yes	We create an instruction-guided detection dataset termed IOD-Bench and develop corresponding evaluation metrics. Our approach involves leveraging the class names provided by the Objects365 dataset (Shao et al., 2019) and utilizing a powerful LLM (Open AI., 2022) to generate instructions related to these classes.
Dataset Splits	Yes	We allocated 80% of these instructions for training, while the remaining is reserved for evaluation. The training dataset is constructed using the in-domain instructions and training images from object365, while the validation set is constructed using 20,000 images from the validation split of Object365, paired with both in-domain and out-of-domain instructions.
Hardware Specification	Yes	We employ a cosine decay learning rate, starting from 2.8e 4, and conduct the training over 12 epochs using 32 GPUs. In the second stage, it takes only around 24 hours on 16 V100 GPUs for instruction tuning.
Software Dependencies	No	The paper mentions software components like 'Flan T5' and 'CLIP' and also implies the use of a deep learning framework, but it does not provide specific version numbers for any software dependencies required for reproducibility.
Experiment Setup	Yes	We employ a cosine decay learning rate, starting from 2.8e 4, and conduct the training over 12 epochs using 32 GPUs. For the instruction tuning phase, if not otherwise specified, we set Pneg to 0.1, the cross attention fusion layers are inserted into the language model s decoder every 3rd layer. The models are trained for 12 epochs. The initial learning rate is set to 2.5e 5 and is decayed by a factor of 0.1 at the 8-th and the 11-th epoch. The max token length is set to 512 following (Chung et al., 2022).