Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GenIR: Generative Visual Feedback for Mental Image Retrieval

Authors: Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Gen IR significantly outperforms existing interactive methods in the MIR scenario.
Researcher Affiliation	Collaboration	1University of California Santa Cruz 2Northeastern University 3Accenture EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Data Annotation Pipeline
Open Source Code	Yes	1Code and data are available at https://github.com/mikelmh025/generative_ir.
Open Datasets	Yes	We evaluate our method across four datasets with distinct visual domains to demonstrate the robustness of our approach. (1) MS COCO [16] s 50k validation set... (2) FFHQ [8]... (3) Flickr30k [21]... (4) Clothing-ADC [17]...
Dataset Splits	Yes	All experiments were conducted using the full 50,000-image validation set as the search space, representing a challenging large-scale retrieval scenario.
Hardware Specification	Yes	All experiments were conducted using 4 NVIDIA A6000 GPUs with 48GB of VRAM each.
Software Dependencies	No	The paper mentions diffusion model names (Infinity, Lumina-Image-2.0, Stable Diffusion 3.5, FLUX.1, Hi Dream-I1) and BLIP-2, as well as Gemma3, but does not provide specific version numbers for ancillary software dependencies like programming languages or libraries (e.g., Python version, PyTorch version).
Experiment Setup	Yes	We provide the hyperparameters used for each of our experimental settings to ensure reproducibility: Table 2: Hyperparameters for diffusion model inference (Model Inference Steps, Guidance Scale, Image Resolution). For all image-to-image retrieval experiments, we used BLIP-2 with the following configuration: Feature dimension: 256 Similarity metric: Cosine similarity Normalization: L2. For Gemma3 (both 4B and 12B variants), we used the following parameters: Temperature: 0.7 Max tokens: 500 Repetition penalty: 1.1 Sampling method: Greedy with temperature.