Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

Authors: Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task. Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR. 4 Experiments 4.1 Experimental Setup 4.1.1 Datasets 4.1.2 Evaluation Metrics 4.1.3 Implementation Details 4.2 Performance Comparison with State-of-the-art 4.3 Ablation Studies
Researcher Affiliation	Academia	Zheyuan Liu EMAIL Australian National University Weixuan Sun EMAIL Australian National University Damien Teney EMAIL Idiap Research Institute Australian Institute for Machine Learning (AIML) Stephen Gould EMAIL Australian National University
Pseudocode	No	The paper describes the methodology using textual explanations and architectural diagrams (e.g., Figure 2, Figure 3), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR.
Open Datasets	Yes	Following previous work, we consider two datasets in diﬀerent domains. Fashion-IQ (Wu et al., 2021) is a dataset of fashion products in three categories, namely Dress, Shirt, and Toptee, which form over 30k triplets with 77k images. The annotations are collected from human annotators and are overall concise. CIRR (Liu et al., 2021) is proposed to speciﬁcally study the ﬁne-grained visiolinguistic cues and implicit human agreements. It contains 36k pairs of queries with human-generated annotations, where images often contain rich object interactions (Suhr et al., 2019). 1Both datasets are publicly released under the MIT License, which allows distributions and academic usages.
Dataset Splits	Yes	For Fashion-IQ, we report results on the validation split, as the ground truths of the test split remain nonpublic. For CIRR, we report our main results on the test split obtained from the evaluation server2.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA A100 80G with Py Torch while enabling automatic mixed precision (Micikevicius et al., 2018).
Software Dependencies	No	All experiments are conducted on a single NVIDIA A100 80G with Py Torch while enabling automatic mixed precision (Micikevicius et al., 2018). We base our implementation on the BLIP codebase3. While PyTorch is mentioned, a specific version number is not provided, nor are specific versions for other libraries or the BLIP codebase.
Experiment Setup	Yes	Image resolution is set to 384 × 384. We initialize the image and text encoders with the BLIP w/ Vi T-B pre-trained weights. In both stages, we freeze the Vi T image encoder and only ﬁnetune the text encoders due to the GPU memory limits. For all models in both stages, we use Adam W (Loshchilov & Hutter, 2019) with an initial learning rate of 2 × 10−5, a weight decay of 0.05, and a cosine learning rate scheduler (Loshchilov & Hutter, 2017) with its minimum learning rate set to 0. For candidate ﬁltering (ﬁrst stage) model, we train with a batch size of 512 for 10 epochs on both Fashion-IQ and CIRR. For candidate re-ranking (second stage) model, we reduce the batch size to 16 due to the GPU memory limit, as it requires exhaustively pairing up queries with each candidate. For Fashion-IQ, we train the re-ranking model for 50 epochs, for CIRR, we train it for 80 epochs.