Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

QuARI: Query Adaptive Retrieval Improvement

Authors: Eric P Xing, Abby Stylianou, Robert Pless, Nathan Jacobs

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results show that this method consistently outperforms state-of-the-art alternatives, including those that require many orders of magnitude more computation at query time. Code and pre-trained models are available at https://github.com/mvrl/Qu ARI. We demonstrate that Qu ARI yields large improvements in retrieval accuracy on multiple extremely challenging retrieval tasks. 4 Evaluation 5.3 Ablation Studies
Researcher Affiliation	Academia	Eric Xing1 Abby Stylianou2 Robert Pless3 Nathan Jacobs1 1Washington University in St. Louis 2Saint Louis University 3George Washington University
Pseudocode	No	The paper describes the methodology using mathematical equations and descriptive text, but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code	No	All code and datasets will be released upon acceptance of this paper, but is not included with this submission.
Open Datasets	Yes	Evaluation Datasets. We focus our evaluation on two challenging benchmarks: ILIAS and INQUIRE. ILIAS (Instance-Level Image retrieval At Scale) is a large-scale dataset designed to assess instance-level image retrieval capabilities [16]. INQUIRE [43] is a text-to-image retrieval benchmark tailored for expert-level ecological queries. Training Datasets. We utilize Microsoft Common Objects in Context (MS COCO) [19], Conceptual Captions 12M, and synthetically augmented Bio Trove [46] to train Qu ARI.
Dataset Splits	No	We utilize Microsoft Common Objects in Context (MS COCO) [19], Conceptual Captions 12M, and synthetically augmented Bio Trove [46] to train Qu ARI. We extract a random subset of 5M samples from Bio Trove for training. While evaluation benchmarks like ILIAS and INQUIRE have defined tasks, the paper does not specify how the training datasets (MS COCO, Conceptual Captions 12M, Bio Trove subset) are split into training, validation, and test sets for their model's internal development.
Hardware Specification	Yes	Experiments were conducted on an NVIDIA H100 with 80GB of VRAM.
Software Dependencies	No	The paper mentions the use of the Adam W optimizer and various backbone models (CLIP, SigLIP, etc.), but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup	Yes	We use the Adam W optimizer [23] with a cosine annealed learning rate cycling between 1e-5 and 2e-7 and a weight decay of 1e-2. We train with a batch size of 320 and a contrastive temperature of 0.07. Qu ARI s transformer backbone is randomly initialized with 4-8 transformer layers depending on the size of the backbone encoder. The query encoder and both the query and column decoders are two-layer MLPs with Ge LU activation functions and layer normalization [1].