Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VIXEN: Visual Text Comparison Network for Image Difference Captioning

Authors: Alexander Black, Jing Shi, Yifei Fan, Tu Bui, John Collomosse

AAAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform our main evaluation on a subset of the Instruct Pix2Pix (Brooks, Holynski, and Efros 2022) dataset, unseen by models during training. To ensure a high quality of the synthetically generated image-caption pairs, we score their correspondence via a user study. Additionally, we crowdsource annotations for a subset of images from the PSBattles (Heller, Rossetto, and Schuldt 2018) dataset and fine-tune and evaluate on Image Editing Request (Tan et al. 2019).
Researcher Affiliation Collaboration Alexander Black1, Jing Shi2, Yifei Fan2, Tu Bui1, John Collomosse1,2 1CVSSP, University of Surrey 2Adobe Research {alex.black|t.v.bui}@surrey.ac.uk, {jingshi|yifan|collomos}@adobe.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. The method is described textually and with a high-level architecture diagram.
Open Source Code Yes Code and data are available at http://github.com/alexblck/vixen
Open Datasets Yes We release an augmentation of the recent Instruct Pix2Pix (IP2P) dataset with synthetic change captions generated via GPT-3 as a basis for training and evaluating VIXEN.
Dataset Splits Yes This results in a 837,466/93,052/5,000 train/validation/test splits.
Hardware Specification Yes Total training time is approximately 100 hours on a single A100 GPU.
Software Dependencies No The paper mentions software components and models like GPT-J, CLIP, BLIP-2, and AdamW optimizer, but it does not specify explicit version numbers for these software dependencies or the underlying programming environment (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes Total training time is approximately 100 hours on a single A100 GPU. We use gradient accumulation to train with an effective batch size of 2048 and optimize the loss using Adam W optimizer with β1 = 0.9, β2 = 0.98 and weight decay 0.05. For all our models we first train with pd = 0 for two epochs, followed by two more epochs with pd = 0.5.