Decomposing Semantic Shifts for Composed Image Retrieval

Authors: Xingyu Yang, Daqing Liu, Heng Zhang, Yong Luo, Chaoyue Wang, Jing Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and Fashion IQ datasets, respectively, and establishes a new state-of-the-art performance.
Researcher Affiliation Collaboration Xingyu Yang1,2*, Daqing Liu3, Heng Zhang4, Yong Luo1,2 , Chaoyue Wang3, Jing Zhang5 1School of Computer Science, National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China, 2Hubei Luojia Laboratory, Wuhan, China, 3JD Explore Academy, JD.com, China, 4Gaoling School of Artifical Intelligence, Renmin University of China, China, 5School of Computer Science, The University of Sydney, Australia
Pseudocode No The paper describes the model architecture and processes in detail, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/starxing-yuu/SSN.
Open Datasets Yes CIRR Dataset (Liu et al. 2021) is the released dataset of open-domain for the CIR task...Fashion IQ Dataset (Wu et al. 2021) is a realistic dataset for interactive image retrieval in the fashion domain.
Dataset Splits Yes In 36,554 triplets, 80% are for training, 10% are for validation, and 10% are for evaluation.
Hardware Specification Yes All experiments can be implemented with Py Torch on a single NVIDIA RTX 3090 Ti GPU.
Software Dependencies No The paper mentions "Py Torch" but does not provide specific version numbers for software dependencies or other libraries.
Experiment Setup Yes The hidden dimension of the 1-layer 8-head transformer encoder is set to 512. The temperature λ of the main retrieval loss (in Eq.(7)) is equal to 100. Note that for Fashion IQ, we fix the image encoder after one training epoch and fine-tune the text encoder only. We adopt Adam W optimizer with an initial learning rate of 5e-5 to train the whole model. We apply the step scheduler to decay the learning rate by 10 every 10 epochs. The batch size is set to 128 and the network is trained for 50 epochs.