Data Roaming and Quality Assessment for Composed Image Retrieval

Authors: Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like Fashion IQ and CIRR.
Researcher Affiliation Collaboration Matan Levy1, Rami Ben-Ari2, Nir Darshan2, Dani Lischinski1 1The Hebrew University of Jerusalem, Israel 2Origin AI, Israel
Pseudocode No The paper describes the architecture and methods but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes To construct La SCo, we leverage the carefully labeled datasets that exist for the well-studied VQA task (Yang et al. 2021). Specifically, we utilize the VQA2.0 (Goyal et al. 2017) dataset to create La SCo with minimal human effort. Fashion IQ (Wu et al. 2021) contains crowdsourced descriptions of differences between images of fashion products. CIRR contains open domain natural images, taken from NLVR2 (Suhr et al. 2019).
Dataset Splits Yes We also present results with pre-training with a mixture of COCO captions, that are very descriptive to better handle samples where the transition text is highly detailed, making the the query image often redundant (i.e. text-to-Image retrieval). To this end, we conduct an experiment where we train CASE on La SCo, replacing 50% of transition-texts Qt, with captions, corresponding to the target image. Namely, we change the train distribution to combine both Co IR and text-to-image samples, as discussed in Sec. 3.1. We then explain the results thru the properties of different datasets in terms of modality redundancy.
Hardware Specification Yes Training on four A100 nodes takes 0.5-6 minutes per epoch, depending on dataset size.
Software Dependencies No The paper mentions software components like BERT, BLIP, and Adam W optimizer, but does not provide specific version numbers for any of them.
Experiment Setup Yes We set an Adam W optimizer, initializing learning rate by 5 10 5 with a exponential decay rate of 0.93 to 1 10 6. We train CASE on CIRR with a batch size of 2048 for 6 epochs. For Fashion IQ, we train with a batch size of 1024 for 20 epochs (further ablation on the batch size is available in suppl. material). On La SCo we train with a batch size of 3840 for 10 epochs.