Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder
Authors: Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task. Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR. 4 Experiments 4.1 Experimental Setup 4.1.1 Datasets 4.1.2 Evaluation Metrics 4.1.3 Implementation Details 4.2 Performance Comparison with State-of-the-art 4.3 Ablation Studies |
| Researcher Affiliation | Academia | Zheyuan Liu EMAIL Australian National University Weixuan Sun EMAIL Australian National University Damien Teney EMAIL Idiap Research Institute Australian Institute for Machine Learning (AIML) Stephen Gould EMAIL Australian National University |
| Pseudocode | No | The paper describes the methodology using textual explanations and architectural diagrams (e.g., Figure 2, Figure 3), but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR. |
| Open Datasets | Yes | Following previous work, we consider two datasets in different domains. Fashion-IQ (Wu et al., 2021) is a dataset of fashion products in three categories, namely Dress, Shirt, and Toptee, which form over 30k triplets with 77k images. The annotations are collected from human annotators and are overall concise. CIRR (Liu et al., 2021) is proposed to specifically study the fine-grained visiolinguistic cues and implicit human agreements. It contains 36k pairs of queries with human-generated annotations, where images often contain rich object interactions (Suhr et al., 2019). 1Both datasets are publicly released under the MIT License, which allows distributions and academic usages. |
| Dataset Splits | Yes | For Fashion-IQ, we report results on the validation split, as the ground truths of the test split remain nonpublic. For CIRR, we report our main results on the test split obtained from the evaluation server2. |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA A100 80G with Py Torch while enabling automatic mixed precision (Micikevicius et al., 2018). |
| Software Dependencies | No | All experiments are conducted on a single NVIDIA A100 80G with Py Torch while enabling automatic mixed precision (Micikevicius et al., 2018). We base our implementation on the BLIP codebase3. While PyTorch is mentioned, a specific version number is not provided, nor are specific versions for other libraries or the BLIP codebase. |
| Experiment Setup | Yes | Image resolution is set to 384 × 384. We initialize the image and text encoders with the BLIP w/ Vi T-B pre-trained weights. In both stages, we freeze the Vi T image encoder and only finetune the text encoders due to the GPU memory limits. For all models in both stages, we use Adam W (Loshchilov & Hutter, 2019) with an initial learning rate of 2 × 10−5, a weight decay of 0.05, and a cosine learning rate scheduler (Loshchilov & Hutter, 2017) with its minimum learning rate set to 0. For candidate filtering (first stage) model, we train with a batch size of 512 for 10 epochs on both Fashion-IQ and CIRR. For candidate re-ranking (second stage) model, we reduce the batch size to 16 due to the GPU memory limit, as it requires exhaustively pairing up queries with each candidate. For Fashion-IQ, we train the re-ranking model for 50 epochs, for CIRR, we train it for 80 epochs. |