VIXEN: Visual Text Comparison Network for Image Difference Captioning
Authors: Alexander Black, Jing Shi, Yifei Fan, Tu Bui, John Collomosse
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform our main evaluation on a subset of the Instruct Pix2Pix (Brooks, Holynski, and Efros 2022) dataset, unseen by models during training. To ensure a high quality of the synthetically generated image-caption pairs, we score their correspondence via a user study. Additionally, we crowdsource annotations for a subset of images from the PSBattles (Heller, Rossetto, and Schuldt 2018) dataset and fine-tune and evaluate on Image Editing Request (Tan et al. 2019). |
| Researcher Affiliation | Collaboration | Alexander Black1, Jing Shi2, Yifei Fan2, Tu Bui1, John Collomosse1,2 1CVSSP, University of Surrey 2Adobe Research {alex.black|t.v.bui}@surrey.ac.uk, {jingshi|yifan|collomos}@adobe.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. The method is described textually and with a high-level architecture diagram. |
| Open Source Code | Yes | Code and data are available at http://github.com/alexblck/vixen |
| Open Datasets | Yes | We release an augmentation of the recent Instruct Pix2Pix (IP2P) dataset with synthetic change captions generated via GPT-3 as a basis for training and evaluating VIXEN. |
| Dataset Splits | Yes | This results in a 837,466/93,052/5,000 train/validation/test splits. |
| Hardware Specification | Yes | Total training time is approximately 100 hours on a single A100 GPU. |
| Software Dependencies | No | The paper mentions software components and models like GPT-J, CLIP, BLIP-2, and AdamW optimizer, but it does not specify explicit version numbers for these software dependencies or the underlying programming environment (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | Total training time is approximately 100 hours on a single A100 GPU. We use gradient accumulation to train with an effective batch size of 2048 and optimize the loss using Adam W optimizer with β1 = 0.9, β2 = 0.98 and weight decay 0.05. For all our models we first train with pd = 0 for two epochs, followed by two more epochs with pd = 0.5. |