Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ExcluIR: Exclusionary Neural Information Retrieval

Authors: Wenhao Zhang, Mengqi Zhang, Shiguang Wu, Jiahuan Pei, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, Pengjie Ren

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct detailed experiments and analyses, obtaining three main observations: (i) existing retrieval models with different architectures struggle to comprehend exclusionary queries effectively; (ii) although integrating our training data can improve the performance of retrieval models on exclusionary retrieval, there still exists a gap compared to human performance; and (iii) generative retrieval models have a natural advantage in handling exclusionary queries.
Researcher Affiliation Academia 1 Shandong University, Qingdao, China 2 Centrum Wiskunde & Informatica, Amsterdam, The Netherlands 3 Leiden University, Leiden, The Netherlands 4 University of Amsterdam, Amsterdam, The Netherlands
Pseudocode No The paper describes methods and processes in natural language, but does not include any explicit sections or figures labeled as 'Pseudocode' or 'Algorithm', nor does it present structured, code-like blocks.
Open Source Code Yes We share the benchmark and evaluation scripts on https://github.com/zwh-sdu/Exclu IR.
Open Datasets Yes We present Exclu IR, a set of resources for exclusionary retrieval, consisting of an evaluation benchmark and a training set for helping retrieval models to comprehend exclusionary queries. The dataset is built based on Hotpot QA (Yang et al. 2018). We share the benchmark and evaluation scripts on https://github.com/zwh-sdu/Exclu IR. NQ is a large-scale dataset for document retrieval and question answering. The version we use is NQ320k, which consists of 320k query-document pairs.
Dataset Splits Yes We split the original Hotpot QA in the same way as our Exclu IR dataset, resulting in a 70k training set and a 3.5k test set. Following the dataset construction process described above, we obtained 3,452 human-annotated entries for the benchmark and 70,293 exclusionary queries for the training set.
Hardware Specification No The paper does not explicitly mention specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions various models and frameworks like Chat GPT (GPT-3.5), T5, BERT, and BART, but does not provide specific version numbers for software libraries or development environments (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication.
Experiment Setup No The paper discusses various models and datasets, and presents evaluation metrics and results, but it does not specify concrete experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or specific optimizer settings in the main text.