Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

EAReranker: Efficient Embedding Adequacy Assessment for Retrieval Augmented Generation

Authors: Dongyang Zeng, Yaping Liu, Wei Zhang, Shuo Zhang, Xinwang Liu, Binxing Fang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation across four public benchmarks demonstrates that EAReranker achieves competitive performance with state-of-the-art plaintext rerankers while maintaining constant memory usage ( 550MB) regardless of input length and processing 2-3x faster than traditional approaches. The semantic bin adequacy prediction accuracy of 92.85% LACC@10 and 86.12% LACC@25 demonstrates its capability to effectively filter out inadequate documents that could potentially mislead or adversely impact RAG system performance, thereby ensuring only high-utility information serves as generation context. These results establish EAReranker as an efficient and practical solution for enhancing RAG system performance through improved context selection while addressing the computational and privacy challenges of existing methods.
Researcher Affiliation	Academia	1Guangzhou University, Guangzhou, China 2National University of Defense Technology, Changsha, China
Pseudocode	Yes	Algorithm 1: Multi-Model Semantic Bin Scoring.
Open Source Code	Yes	The source code of EAReranker is available in https://github.com/zjzdy/EAReranker.
Open Datasets	Yes	Our evaluation utilized a dataset comprising 1 million query-document pairs, partitioned into training (80%) and testing (20%) sets. Vector representations were generated using established embedding models: bge-m3 (1024-dimensional)[15], jina-embeddings-v3 (jina-v3, 1024-dimensional) [16], gte-multilingual-base (gte-base, 768-dimensional) [18], and Ka LM-embedding-multilingual-miniinstruct-v1.5 (Ka LM, 896-dimensional) [20]. We deliberately excluded higher-dimensional models like NV-Embed-v2 (4096-dimensional) [25] to focus on commonly dimensions in RAG. We curated a comprehensive adequacy assessment dataset by augmenting the bge-m3-data [15] comprising diverse query-document pairs with additional samples sourced from multiple heterogeneous collections [26, 27, 28, 29, 30, 31, 32, 33, 34, 35]. This aggregated dataset spans a wide spectrum of domains and query intents, encompassing fact verification, specialized knowledge retrieval, etc. Evaluation across four public benchmarks (FEVER [30], NFCorpus [28], Du Retrieval [42], and T2Ranking [43]) reveals consistent performance patterns, as shown in Table 3.
Dataset Splits	Yes	Our evaluation utilized a dataset comprising 1 million query-document pairs, partitioned into training (80%) and testing (20%) sets.
Hardware Specification	Yes	All experiments were conducted on a server equipped with an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, 256GB RAM, and NVIDIA GeForce RTX 3090 GPUs running Ubuntu 20.04.
Software Dependencies	Yes	The model was implemented using PyTorch 2.0.1 and trained on Python 3.9.
Experiment Setup	Yes	EAReranker employs a stacked Transformer architecture with 4 layers and an embedding dimension expansion factor of 4, balancing representational capacity with computational efficiency. Training utilized the AdamW optimizer (batch size 256, learning rate 1e-5) for 50 epochs with early stopping.