Image Difference Captioning with Pre-training and Contrastive Learning
Authors: Linli Yao, Weiying Wang, Qin Jin3108-3116
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two IDC benchmark datasets, CLEVR-Change and Birds-to-Words, demonstrate the effectiveness of the proposed modeling framework. |
| Researcher Affiliation | Academia | School of Information, Renmin University of China {linliyao, wy.wang, qjin}@ruc.edu.cn |
| Pseudocode | No | The paper describes the model and pre-training tasks in text and diagrams, but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The codes and models will be released at https://github.com/yaolinli/IDC. |
| Open Datasets | Yes | CLEVR-Change dataset (Park, Darrell, and Rohrbach 2019) is automatically built via the CLEVR engine... Birds-to-Words dataset (Tan et al. 2019) describes the fine-grained difference... CUB (Wah et al. 2011) serves as a single image captioning dataset... NABirds (Van Horn et al. 2015) is a fine-grained visual classification dataset... |
| Dataset Splits | Yes | CLEVR-Change dataset (Park, Darrell, and Rohrbach 2019)... 67,660, 3,976 and 7,970 image pairs for training, validation and test split respectively. ... For CLEVR-Change, we sample the batch from the three pre-training tasks with ratio of MLM:MVCL:FDA=8:1:2. ... CUB (Wah et al. 2011)... We use the split of 8855 training images and 2933 validation images following (Yan et al. 2021). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for the experiments. |
| Software Dependencies | No | The paper mentions tools like ResNet101, Stanford Core NLP, and Adam optimizer, but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | The word embedding is learned from scratch and its dimension is 512. For the cross-modal transformer, the hidden size is 512, the attention head is 8, and the layer number is 2 for Birds-to-Words and 3 for CLEVR-Change. We set τ1, τ2 in contrastive learning to 1. In FDA task, we rewrite 6 negative sentences for each image pair, among which retrieve:replace:confuse=2:2:2. For CLEVR-Change, we sample the batch from the three pre-training tasks with ratio of MLM:MVCL:FDA=8:1:2. We pre-train the model with 8K warm-up steps and 250K iterations in total. For Birds-to-words, the ratio of pre-training tasks is MLM:MVCL:FDA=9:1:2. The warm-up steps are 4K and total training steps are 50K. In the pre-training stage, we apply Adam (Kingma and Ba 2014) optimizer with learning rate 1e-4. In the finetuning stage, the learning rate is set as 3e-5. Early-stop is applied on the main metric to avoid overfitting. The sentence is generated with greedy search in inference. |