Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Image Difference Captioning with Pre-training and Contrastive Learning
Authors: Linli Yao, Weiying Wang, Qin Jin3108-3116
AAAI 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two IDC benchmark datasets, CLEVR-Change and Birds-to-Words, demonstrate the effectiveness of the proposed modeling framework. |
| Researcher Affiliation | Academia | School of Information, Renmin University of China EMAIL |
| Pseudocode | No | The paper describes the model and pre-training tasks in text and diagrams, but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The codes and models will be released at https://github.com/yaolinli/IDC. |
| Open Datasets | Yes | CLEVR-Change dataset (Park, Darrell, and Rohrbach 2019) is automatically built via the CLEVR engine... Birds-to-Words dataset (Tan et al. 2019) describes the fine-grained difference... CUB (Wah et al. 2011) serves as a single image captioning dataset... NABirds (Van Horn et al. 2015) is a fine-grained visual classification dataset... |
| Dataset Splits | Yes | CLEVR-Change dataset (Park, Darrell, and Rohrbach 2019)... 67,660, 3,976 and 7,970 image pairs for training, validation and test split respectively. ... For CLEVR-Change, we sample the batch from the three pre-training tasks with ratio of MLM:MVCL:FDA=8:1:2. ... CUB (Wah et al. 2011)... We use the split of 8855 training images and 2933 validation images following (Yan et al. 2021). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for the experiments. |
| Software Dependencies | No | The paper mentions tools like ResNet101, Stanford Core NLP, and Adam optimizer, but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | The word embedding is learned from scratch and its dimension is 512. For the cross-modal transformer, the hidden size is 512, the attention head is 8, and the layer number is 2 for Birds-to-Words and 3 for CLEVR-Change. We set τ1, τ2 in contrastive learning to 1. In FDA task, we rewrite 6 negative sentences for each image pair, among which retrieve:replace:confuse=2:2:2. For CLEVR-Change, we sample the batch from the three pre-training tasks with ratio of MLM:MVCL:FDA=8:1:2. We pre-train the model with 8K warm-up steps and 250K iterations in total. For Birds-to-words, the ratio of pre-training tasks is MLM:MVCL:FDA=9:1:2. The warm-up steps are 4K and total training steps are 50K. In the pre-training stage, we apply Adam (Kingma and Ba 2014) optimizer with learning rate 1e-4. In the finetuning stage, the learning rate is set as 3e-5. Early-stop is applied on the main metric to avoid overfitting. The sentence is generated with greedy search in inference. |