Sentence-level Prompts Benefit Composed Image Retrieval

Authors: Yang bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo, Rick Siow Mong Goh, Chun-Mei Feng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
Researcher Affiliation Collaboration Yang Bai1 Xinxing Xu1 Yong Liu1 Salman Khan2,3 Fahad Khan2 Wangmeng Zuo4 Rick Siow Mong Goh1 Chun-Mei Feng1 1Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore 2Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), UAE 3Australian National University, Canberra ACT, Australia 4Harbin Institute of Technology, Harbin, China
Pseudocode No The paper does not contain pseudocode or a clearly labeled algorithm block.
Open Source Code Yes fengcm.ai@gmail.com https://github.com/chunmeifeng/SPRC
Open Datasets Yes We evaluate our method on two CIR benchmarks: (1) Fashion-IQ a fashion dataset with 77, 684 images forming 30, 134 triplets (Wu et al., 2021). ... (2) CIRR is a general image dataset that comprises 36, 554 triplets derived from 21, 552 images from the popular natural language inference dataset NLVR2 (Suhr et al., 2018).
Dataset Splits Yes We randomly split this dataset into training, validation, and test sets in an 8 : 1 : 1 ratio.
Hardware Specification Yes Our method is implemented with Pytorch on one NVIDIA RTX A100 GPU with 40GB memory.
Software Dependencies No The paper mentions 'Pytorch' but does not specify a version number or other software dependencies with versions.
Experiment Setup Yes We resize the input image size to 224 224 and with a padding ratio of 1.25 for uniformity (Baldrati et al., 2022b). The learning rate is initialized to 1e-5 and 2e-5 following a cosine schedule for the CIRR and Fashion-IQ datasets, respectively. The hyperparameters of prompt length and γ are set to 32 and 0.8, respectively.