VIXEN: Visual Text Comparison Network for Image Difference Captioning

Authors: Alexander Black, Jing Shi, Yifei Fan, Tu Bui, John Collomosse

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform our main evaluation on a subset of the Instruct Pix2Pix (Brooks, Holynski, and Efros 2022) dataset, unseen by models during training. To ensure a high quality of the synthetically generated image-caption pairs, we score their correspondence via a user study. Additionally, we crowdsource annotations for a subset of images from the PSBattles (Heller, Rossetto, and Schuldt 2018) dataset and fine-tune and evaluate on Image Editing Request (Tan et al. 2019).
Researcher Affiliation Collaboration Alexander Black1, Jing Shi2, Yifei Fan2, Tu Bui1, John Collomosse1,2 1CVSSP, University of Surrey 2Adobe Research {alex.black|t.v.bui}@surrey.ac.uk, {jingshi|yifan|collomos}@adobe.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. The method is described textually and with a high-level architecture diagram.
Open Source Code Yes Code and data are available at http://github.com/alexblck/vixen
Open Datasets Yes We release an augmentation of the recent Instruct Pix2Pix (IP2P) dataset with synthetic change captions generated via GPT-3 as a basis for training and evaluating VIXEN.
Dataset Splits Yes This results in a 837,466/93,052/5,000 train/validation/test splits.
Hardware Specification Yes Total training time is approximately 100 hours on a single A100 GPU.
Software Dependencies No The paper mentions software components and models like GPT-J, CLIP, BLIP-2, and AdamW optimizer, but it does not specify explicit version numbers for these software dependencies or the underlying programming environment (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes Total training time is approximately 100 hours on a single A100 GPU. We use gradient accumulation to train with an effective batch size of 2048 and optimize the loss using Adam W optimizer with β1 = 0.9, β2 = 0.98 and weight decay 0.05. For all our models we first train with pd = 0 for two epochs, followed by two more epochs with pd = 0.5.