Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
ExcluIR: Exclusionary Neural Information Retrieval
Authors: Wenhao Zhang, Mengqi Zhang, Shiguang Wu, Jiahuan Pei, Zhaochun Ren, Maarten de Rijke, Zhumin Chen, Pengjie Ren
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct detailed experiments and analyses, obtaining three main observations: (i) existing retrieval models with different architectures struggle to comprehend exclusionary queries effectively; (ii) although integrating our training data can improve the performance of retrieval models on exclusionary retrieval, there still exists a gap compared to human performance; and (iii) generative retrieval models have a natural advantage in handling exclusionary queries. |
| Researcher Affiliation | Academia | 1 Shandong University, Qingdao, China 2 Centrum Wiskunde & Informatica, Amsterdam, The Netherlands 3 Leiden University, Leiden, The Netherlands 4 University of Amsterdam, Amsterdam, The Netherlands |
| Pseudocode | No | The paper describes methods and processes in natural language, but does not include any explicit sections or figures labeled as 'Pseudocode' or 'Algorithm', nor does it present structured, code-like blocks. |
| Open Source Code | Yes | We share the benchmark and evaluation scripts on https://github.com/zwh-sdu/Exclu IR. |
| Open Datasets | Yes | We present Exclu IR, a set of resources for exclusionary retrieval, consisting of an evaluation benchmark and a training set for helping retrieval models to comprehend exclusionary queries. The dataset is built based on Hotpot QA (Yang et al. 2018). We share the benchmark and evaluation scripts on https://github.com/zwh-sdu/Exclu IR. NQ is a large-scale dataset for document retrieval and question answering. The version we use is NQ320k, which consists of 320k query-document pairs. |
| Dataset Splits | Yes | We split the original Hotpot QA in the same way as our Exclu IR dataset, resulting in a 70k training set and a 3.5k test set. Following the dataset construction process described above, we obtained 3,452 human-annotated entries for the benchmark and 70,293 exclusionary queries for the training set. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions various models and frameworks like Chat GPT (GPT-3.5), T5, BERT, and BART, but does not provide specific version numbers for software libraries or development environments (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication. |
| Experiment Setup | No | The paper discusses various models and datasets, and presents evaluation metrics and results, but it does not specify concrete experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or specific optimizer settings in the main text. |