Data Roaming and Quality Assessment for Composed Image Retrieval
Authors: Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that this new baseline outperforms the current state-of-the-art methods on established benchmarks like Fashion IQ and CIRR. |
| Researcher Affiliation | Collaboration | Matan Levy1, Rami Ben-Ari2, Nir Darshan2, Dani Lischinski1 1The Hebrew University of Jerusalem, Israel 2Origin AI, Israel |
| Pseudocode | No | The paper describes the architecture and methods but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | To construct La SCo, we leverage the carefully labeled datasets that exist for the well-studied VQA task (Yang et al. 2021). Specifically, we utilize the VQA2.0 (Goyal et al. 2017) dataset to create La SCo with minimal human effort. Fashion IQ (Wu et al. 2021) contains crowdsourced descriptions of differences between images of fashion products. CIRR contains open domain natural images, taken from NLVR2 (Suhr et al. 2019). |
| Dataset Splits | Yes | We also present results with pre-training with a mixture of COCO captions, that are very descriptive to better handle samples where the transition text is highly detailed, making the the query image often redundant (i.e. text-to-Image retrieval). To this end, we conduct an experiment where we train CASE on La SCo, replacing 50% of transition-texts Qt, with captions, corresponding to the target image. Namely, we change the train distribution to combine both Co IR and text-to-image samples, as discussed in Sec. 3.1. We then explain the results thru the properties of different datasets in terms of modality redundancy. |
| Hardware Specification | Yes | Training on four A100 nodes takes 0.5-6 minutes per epoch, depending on dataset size. |
| Software Dependencies | No | The paper mentions software components like BERT, BLIP, and Adam W optimizer, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We set an Adam W optimizer, initializing learning rate by 5 10 5 with a exponential decay rate of 0.93 to 1 10 6. We train CASE on CIRR with a batch size of 2048 for 6 epochs. For Fashion IQ, we train with a batch size of 1024 for 20 epochs (further ablation on the batch size is available in suppl. material). On La SCo we train with a batch size of 3840 for 10 epochs. |