Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents

Authors: Haoyu Wang, Sunhao Dai, Haiyuan Zhao, Liang Pang, Xiao Zhang, Gang Wang, Zhenhua Dong, Jun Xu, Ji-Rong Wen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across three domains demonstrate the superior debiasing effectiveness of CDC, emphasizing the validity of our proposed explanatory framework. In this section, we conduct empirical experiments and theoretical analysis to substantiate that PLMbased retrievers assign higher relevance scores to documents with lower perplexity.
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2CAS Key Laboratory of AI Safety, Institute of Computing Technology, Beijing, China 3Huawei Noah s Ark Lab, Shenzhen, China
Pseudocode	Yes	Algorithm 1: The Proposed CDC: Debiasing with Causal Diagnosis and Correction
Open Source Code	Yes	1Codes are available at https://github.com/WhyDwelledOnAi/Perplexity-Trap.
Open Datasets	Yes	Datasets. We select three widely-used IR datasets from different domains to ensure the broad applicability of our findings: (1) DL19 dataset (Craswell et al., 2020) for exploring retrieval across miscellaneous domains. (2) TREC-COVID dataset (Voorhees et al., 2021) focused on biomedical information retrieval. (3) SCIDOCS (Cohan et al., 2020) dedicated to the retrieval of scientific scholarly articles.
Dataset Splits	Yes	At domain-level, we employ bias diagnosis on the training set of DL19 to estimate the biased effect β2 for each retrieval model, and then conduct in-domain and cross-domain evaluation on the test sets of DL19, TREC-COVID, SCIDOCS. Note that only 128 samples (i.e., estimation budget M = 128) are used for bias diagnosis, this sample size is sufficient for effective results.
Hardware Specification	Yes	Our experiments are all conducted on machines equipped with NVIDIA A6000 GPUs and 52-core Intel(R) Xeon(R) Gold 6230R CPUs at 2.10GHz.
Software Dependencies	No	The paper mentions models like BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ANCE (Xiong et al., 2020), TAS-B (Hofstätter et al., 2021), Contriever (Izacard et al., 2022), coCondenser (Gao and Callan, 2022) and LLMs such as Llama2-7B-chat (Touvron et al., 2023), GPT-4 (Achiam et al., 2023), GPT-3.5, and Mistral (Jiang et al., 2023). However, it does not specify version numbers for the underlying software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	Specifically, we manipulate the sampling temperatures during generation to obtain LLM-generated documents with different PPLs but similar semantic content. Following the method of Dai et al. (2024c), we use the following simple prompt: Please rewrite the following text: {human-written text} . Note that only 128 samples (i.e., estimation budget M = 128) are used for bias diagnosis, this sample size is sufficient for effective results.