Image Difference Captioning with Pre-training and Contrastive Learning

Authors: Linli Yao, Weiying Wang, Qin Jin3108-3116

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two IDC benchmark datasets, CLEVR-Change and Birds-to-Words, demonstrate the effectiveness of the proposed modeling framework.
Researcher Affiliation Academia School of Information, Renmin University of China {linliyao, wy.wang, qjin}@ruc.edu.cn
Pseudocode No The paper describes the model and pre-training tasks in text and diagrams, but does not include any pseudocode or algorithm blocks.
Open Source Code Yes The codes and models will be released at https://github.com/yaolinli/IDC.
Open Datasets Yes CLEVR-Change dataset (Park, Darrell, and Rohrbach 2019) is automatically built via the CLEVR engine... Birds-to-Words dataset (Tan et al. 2019) describes the fine-grained difference... CUB (Wah et al. 2011) serves as a single image captioning dataset... NABirds (Van Horn et al. 2015) is a fine-grained visual classification dataset...
Dataset Splits Yes CLEVR-Change dataset (Park, Darrell, and Rohrbach 2019)... 67,660, 3,976 and 7,970 image pairs for training, validation and test split respectively. ... For CLEVR-Change, we sample the batch from the three pre-training tasks with ratio of MLM:MVCL:FDA=8:1:2. ... CUB (Wah et al. 2011)... We use the split of 8855 training images and 2933 validation images following (Yan et al. 2021).
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for the experiments.
Software Dependencies No The paper mentions tools like ResNet101, Stanford Core NLP, and Adam optimizer, but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes The word embedding is learned from scratch and its dimension is 512. For the cross-modal transformer, the hidden size is 512, the attention head is 8, and the layer number is 2 for Birds-to-Words and 3 for CLEVR-Change. We set τ1, τ2 in contrastive learning to 1. In FDA task, we rewrite 6 negative sentences for each image pair, among which retrieve:replace:confuse=2:2:2. For CLEVR-Change, we sample the batch from the three pre-training tasks with ratio of MLM:MVCL:FDA=8:1:2. We pre-train the model with 8K warm-up steps and 250K iterations in total. For Birds-to-words, the ratio of pre-training tasks is MLM:MVCL:FDA=9:1:2. The warm-up steps are 4K and total training steps are 50K. In the pre-training stage, we apply Adam (Kingma and Ba 2014) optimizer with learning rate 1e-4. In the finetuning stage, the learning rate is set as 3e-5. Early-stop is applied on the main metric to avoid overfitting. The sentence is generated with greedy search in inference.