Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents
Authors: Haoyu Wang, Sunhao Dai, Haiyuan Zhao, Liang Pang, Xiao Zhang, Gang Wang, Zhenhua Dong, Jun Xu, Ji-Rong Wen
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across three domains demonstrate the superior debiasing effectiveness of CDC, emphasizing the validity of our proposed explanatory framework. In this section, we conduct empirical experiments and theoretical analysis to substantiate that PLMbased retrievers assign higher relevance scores to documents with lower perplexity. |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2CAS Key Laboratory of AI Safety, Institute of Computing Technology, Beijing, China 3Huawei Noah s Ark Lab, Shenzhen, China |
| Pseudocode | Yes | Algorithm 1: The Proposed CDC: Debiasing with Causal Diagnosis and Correction |
| Open Source Code | Yes | 1Codes are available at https://github.com/WhyDwelledOnAi/Perplexity-Trap. |
| Open Datasets | Yes | Datasets. We select three widely-used IR datasets from different domains to ensure the broad applicability of our findings: (1) DL19 dataset (Craswell et al., 2020) for exploring retrieval across miscellaneous domains. (2) TREC-COVID dataset (Voorhees et al., 2021) focused on biomedical information retrieval. (3) SCIDOCS (Cohan et al., 2020) dedicated to the retrieval of scientific scholarly articles. |
| Dataset Splits | Yes | At domain-level, we employ bias diagnosis on the training set of DL19 to estimate the biased effect β2 for each retrieval model, and then conduct in-domain and cross-domain evaluation on the test sets of DL19, TREC-COVID, SCIDOCS. Note that only 128 samples (i.e., estimation budget M = 128) are used for bias diagnosis, this sample size is sufficient for effective results. |
| Hardware Specification | Yes | Our experiments are all conducted on machines equipped with NVIDIA A6000 GPUs and 52-core Intel(R) Xeon(R) Gold 6230R CPUs at 2.10GHz. |
| Software Dependencies | No | The paper mentions models like BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ANCE (Xiong et al., 2020), TAS-B (Hofstätter et al., 2021), Contriever (Izacard et al., 2022), coCondenser (Gao and Callan, 2022) and LLMs such as Llama2-7B-chat (Touvron et al., 2023), GPT-4 (Achiam et al., 2023), GPT-3.5, and Mistral (Jiang et al., 2023). However, it does not specify version numbers for the underlying software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Specifically, we manipulate the sampling temperatures during generation to obtain LLM-generated documents with different PPLs but similar semantic content. Following the method of Dai et al. (2024c), we use the following simple prompt: Please rewrite the following text: {human-written text} . Note that only 128 samples (i.e., estimation budget M = 128) are used for bias diagnosis, this sample size is sufficient for effective results. |