reproducibilityindex.ai

Image Difference Captioning with Pre-training and Contrastive Learning

Authors: Linli Yao, Weiying Wang, Qin Jin3108-3116

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on two IDC benchmark datasets, CLEVR-Change and Birds-to-Words, demonstrate the effectiveness of the proposed modeling framework.
Researcher Affiliation	Academia	School of Information, Renmin University of China {linliyao, wy.wang, qjin}@ruc.edu.cn
Pseudocode	No	The paper describes the model and pre-training tasks in text and diagrams, but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	The codes and models will be released at https://github.com/yaolinli/IDC.
Open Datasets	Yes	CLEVR-Change dataset (Park, Darrell, and Rohrbach 2019) is automatically built via the CLEVR engine... Birds-to-Words dataset (Tan et al. 2019) describes the ﬁne-grained difference... CUB (Wah et al. 2011) serves as a single image captioning dataset... NABirds (Van Horn et al. 2015) is a ﬁne-grained visual classiﬁcation dataset...
Dataset Splits	Yes	CLEVR-Change dataset (Park, Darrell, and Rohrbach 2019)... 67,660, 3,976 and 7,970 image pairs for training, validation and test split respectively. ... For CLEVR-Change, we sample the batch from the three pre-training tasks with ratio of MLM:MVCL:FDA=8:1:2. ... CUB (Wah et al. 2011)... We use the split of 8855 training images and 2933 validation images following (Yan et al. 2021).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for the experiments.
Software Dependencies	No	The paper mentions tools like ResNet101, Stanford Core NLP, and Adam optimizer, but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup	Yes	The word embedding is learned from scratch and its dimension is 512. For the cross-modal transformer, the hidden size is 512, the attention head is 8, and the layer number is 2 for Birds-to-Words and 3 for CLEVR-Change. We set τ1, τ2 in contrastive learning to 1. In FDA task, we rewrite 6 negative sentences for each image pair, among which retrieve:replace:confuse=2:2:2. For CLEVR-Change, we sample the batch from the three pre-training tasks with ratio of MLM:MVCL:FDA=8:1:2. We pre-train the model with 8K warm-up steps and 250K iterations in total. For Birds-to-words, the ratio of pre-training tasks is MLM:MVCL:FDA=9:1:2. The warm-up steps are 4K and total training steps are 50K. In the pre-training stage, we apply Adam (Kingma and Ba 2014) optimizer with learning rate 1e-4. In the ﬁnetuning stage, the learning rate is set as 3e-5. Early-stop is applied on the main metric to avoid overﬁtting. The sentence is generated with greedy search in inference.