Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Towards Robust Uncertainty Calibration for Composed Image Retrieval

Authors: Yifan Wang, Wuliang Huang, Yufan Wen, Shunning Liu, Chun Yuan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and ablation analysis on benchmark datasets Fashion IQ and CIRR verify the robustness of RUNC in predicting reliable retrieval results from a large image gallery.
Researcher Affiliation	Academia	Yifan Wang1, Wuliang Huang2, Yufan Wen1, Shunning Liu1, Chun Yuan1 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Institute of Computing Technology, Chinese Academy of Sciences
Pseudocode	No	The paper describes the methodology using mathematical equations and textual descriptions, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	2Code is available at: RUNC-source. While this statement mentions code availability, "RUNC-source" is not a direct, specific URL or a clear reference to supplementary material where the code can be accessed. It appears to be an incomplete reference.
Open Datasets	Yes	Datasets. The proposed RUNC is employed on two widely-used composed image retrieval datasets covering various modification requirements in real-life retrieval scenarios. Fashion IQ [54], concentrating on fashion item retrieval, addresses the retrieval for modifications in attributes including colors, patterns, textures, and design details across dress, toptee, and shirt categories. The whole dataset contains 77,684 fashion pictures and each matched triplet is constituted of a reference image, a modification sentence, and one target image. Following [1, 18], the dataset is split by the proportion of 3:1:1 for training, validating, and testing respectively. CIRR [48] involves more natural scenes and query texts more focus on alterations in the relationships among subjects, backgrounds, and multiple subjects within intricate images. It consists of 21,552 images collected from the NLVR2 dataset [55] and constructs 36,554 matched pairs.
Dataset Splits	Yes	Fashion IQ [54], concentrating on fashion item retrieval, addresses the retrieval for modifications in attributes including colors, patterns, textures, and design details across dress, toptee, and shirt categories. The whole dataset contains 77,684 fashion pictures and each matched triplet is constituted of a reference image, a modification sentence, and one target image. Following [1, 18], the dataset is split by the proportion of 3:1:1 for training, validating, and testing respectively.
Hardware Specification	Yes	The experiments were implemented in Pytorch on a single NVIDIA A800 GPU and trained for 30 epochs for Fashion IQ and 50 epochs for CIRR2.
Software Dependencies	No	The experiments were implemented in Pytorch on a single NVIDIA A800 GPU and trained for 30 epochs for Fashion IQ and 50 epochs for CIRR2. We exploited the visual and textual encoders as BLIP-2Vi T G/14 model and initialized parameters from pre-trained EVA-CLIP [32] weights. While PyTorch is mentioned, specific version numbers for PyTorch or other critical libraries are not provided.
Experiment Setup	Yes	The visual encoder remained frozen and the remaining layers were fine-tuned in the training phase. The virtual guidance was disabled during the inference phase. The uncertainty perceptron was implemented as one feed-forward network (two linear layers) with a softplus activation function. The dropout rate was set as 0.2. The dimensions of fusion and candidate features were fixed as 256 in the embedding space and the number of learnable queries was set as 32. We set λ1 as 0.01 to in Eq. 10. We used Adam W optimizer and set the learning rate as 2 10 5 in Fashion IQ and 1 10 5 in CIRR with cosine annealing decay. The training and inference time of the proposed model are 214.3s and 27.4s respectively. The experiments were implemented in Pytorch on a single NVIDIA A800 GPU and trained for 30 epochs for Fashion IQ and 50 epochs for CIRR2.