Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GenIR: Generative Visual Feedback for Mental Image Retrieval

Authors: Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that Gen IR significantly outperforms existing interactive methods in the MIR scenario.
Researcher Affiliation Collaboration 1University of California Santa Cruz 2Northeastern University 3Accenture EMAIL EMAIL
Pseudocode Yes Algorithm 1 Data Annotation Pipeline
Open Source Code Yes 1Code and data are available at https://github.com/mikelmh025/generative_ir.
Open Datasets Yes We evaluate our method across four datasets with distinct visual domains to demonstrate the robustness of our approach. (1) MS COCO [16] s 50k validation set... (2) FFHQ [8]... (3) Flickr30k [21]... (4) Clothing-ADC [17]...
Dataset Splits Yes All experiments were conducted using the full 50,000-image validation set as the search space, representing a challenging large-scale retrieval scenario.
Hardware Specification Yes All experiments were conducted using 4 NVIDIA A6000 GPUs with 48GB of VRAM each.
Software Dependencies No The paper mentions diffusion model names (Infinity, Lumina-Image-2.0, Stable Diffusion 3.5, FLUX.1, Hi Dream-I1) and BLIP-2, as well as Gemma3, but does not provide specific version numbers for ancillary software dependencies like programming languages or libraries (e.g., Python version, PyTorch version).
Experiment Setup Yes We provide the hyperparameters used for each of our experimental settings to ensure reproducibility: Table 2: Hyperparameters for diffusion model inference (Model Inference Steps, Guidance Scale, Image Resolution). For all image-to-image retrieval experiments, we used BLIP-2 with the following configuration: Feature dimension: 256 Similarity metric: Cosine similarity Normalization: L2. For Gemma3 (both 4B and 12B variants), we used the following parameters: Temperature: 0.7 Max tokens: 500 Repetition penalty: 1.1 Sampling method: Greedy with temperature.