Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ExcluIR: Exclusionary Neural Information Retrieval

Authors: Wenhao Zhang, Mengqi Zhang, Shiguang Wu, Jiahuan Pei, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, Pengjie Ren

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct detailed experiments and analyses, obtaining three main observations: (i) existing retrieval models with different architectures struggle to comprehend exclusionary queries effectively; (ii) although integrating our training data can improve the performance of retrieval models on exclusionary retrieval, there still exists a gap compared to human performance; and (iii) generative retrieval models have a natural advantage in handling exclusionary queries.
Researcher Affiliation	Academia	1 Shandong University, Qingdao, China 2 Centrum Wiskunde & Informatica, Amsterdam, The Netherlands 3 Leiden University, Leiden, The Netherlands 4 University of Amsterdam, Amsterdam, The Netherlands
Pseudocode	No	The paper describes methods and processes in natural language, but does not include any explicit sections or figures labeled as 'Pseudocode' or 'Algorithm', nor does it present structured, code-like blocks.
Open Source Code	Yes	We share the benchmark and evaluation scripts on https://github.com/zwh-sdu/Exclu IR.
Open Datasets	Yes	We present Exclu IR, a set of resources for exclusionary retrieval, consisting of an evaluation benchmark and a training set for helping retrieval models to comprehend exclusionary queries. The dataset is built based on Hotpot QA (Yang et al. 2018). We share the benchmark and evaluation scripts on https://github.com/zwh-sdu/Exclu IR. NQ is a large-scale dataset for document retrieval and question answering. The version we use is NQ320k, which consists of 320k query-document pairs.
Dataset Splits	Yes	We split the original Hotpot QA in the same way as our Exclu IR dataset, resulting in a 70k training set and a 3.5k test set. Following the dataset construction process described above, we obtained 3,452 human-annotated entries for the benchmark and 70,293 exclusionary queries for the training set.
Hardware Specification	No	The paper does not explicitly mention specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions various models and frameworks like Chat GPT (GPT-3.5), T5, BERT, and BART, but does not provide specific version numbers for software libraries or development environments (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication.
Experiment Setup	No	The paper discusses various models and datasets, and presents evaluation metrics and results, but it does not specify concrete experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or specific optimizer settings in the main text.