Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Instance-Level Composed Image Retrieval

Authors: Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagiotis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, Giorgos Tolias

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate BASIC on our proposed i-CIR as well as four composed image retrieval benchmarks: Image Net-R, Mini DN, NICO++, and LTLL. Retrieval performance is measured using the standard mean Average Precision (m AP) metric. Table 1: Average m AP (%) comparison across datasets. 5.3 Ablation studies
Researcher Affiliation	Academia	1VRG, FEE, Czech Technical University in Prague 2Robotics Institute, Athena Research Center 3National Technical University of Athens 4Hellenic Robotics Center of Excellence 5IARAI
Pseudocode	No	The paper describes methods and processes like BASIC in text and with a high-level overview in Figure 4, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	No	Project page: https://vrg.fel.cvut.cz/icir/. Both i-CIR dataset and code will be made publicly available through our project page https://vrg.fel.cvut.cz/icir/.
Open Datasets	Yes	We introduce i-CIR, a new evaluation dataset for CIR, meant to retrieve images containing the same particular object as the visual query under modifications defined by the text query. All i-CIR images sourced from the LAION [46] dataset. Both i-CIR dataset and code will be made publicly available through our project page https://vrg.fel.cvut.cz/icir/.
Dataset Splits	Yes	For i-CIR, we report the macro-m AP over instances, defined by first computing m AP per instance and then taking the mean of these per-instance m APs across all instances. During the development of BASIC, we set aside a small portion of the i-CIR crawl as a development set; none of these images appears in the final test benchmark. i-CIR dev consists of 15 object instances, 92 composed queries, and 45K images in total.
Hardware Specification	Yes	We acknowledge VSB Technical University of Ostrava, IT4Innovations National Supercomputing Center, Czech Republic, for awarding this project (OPEN-33-67) access to the LUMI supercomputer, owned by the Euro HPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium, through the Ministry of Education, Youth and Sports of the Czech Republic via the e-INFRA CZ project (ID: 90254).
Software Dependencies	No	All methods use CLIP with ViT-L/14 [7], whereas Compo Diff employs the larger CLIP ViT-G/14. The corpora C+ and C were automatically generated using Chat GPT [17]. Specifically, we first used Chat GPT to generate 100 simple and diverse textual prompts... Each prompt was then used to generate 4 corresponding images using Stable Diffusion [42]... can be efficiently handled by existing libraries, e.g. FAISS [22].
Experiment Setup	Yes	We set k = 250 components for PCA, λ = 0.1 for the Harris criterion and α = 0.2. These values were fixed once on a small privately owned development set, named i-CIR dev. The corpora C+ and C were automatically generated using Chat GPT [17]. The statistics sv min and st min were computed over a synthetically generated dataset constructed using Stable Diffusion [42] with automatically created prompts.