Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking

Authors: Soyoung Yoon, Gyuwan Kim, Gyu-Hwung Cho, seung-won hwang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results on the TREC-DL and BEIR benchmarks show that our method consistently achieves a superior accuracy efficiency trade-off and scales better with compute than fixed-computation baselines. These results highlight the effectiveness and generalizability of our method across diverse retrieval tasks and LLM-based reranking models. We evaluate our method on the TREC Deep Learning [16] and BEIR [17] benchmarks using multiple LLM-based rerankers. Our approach consistently improves ranking quality while requiring fewer reranker calls across diverse datasets and reranker models, including both in-domain and out-of-domain settings. These results demonstrate the effectiveness of uncertainty-aware adaptive computation and the broad applicability of our method.
Researcher Affiliation	Academia	Soyoung Yoon Seoul National University EMAIL Gyuwan Kim University of California, Santa Barbara EMAIL Gyu-Hwung Cho Seoul National University EMAIL Seung-won Hwang Seoul National University EMAIL
Pseudocode	Yes	Algorithm 1 outlines Acu Rank, a listwise reranking method that performs adaptive computation guided by probabilistic relevance modeling and uncertainty estimation. At each iteration, Acu Rank identifies documents with uncertain rankings and focuses reranking efforts on them, updating their relevance estimates based on listwise reranker outputs. Algorithm 1 Acu Rank: Uncertainty-Aware Adaptive Computation for Listwise Reranking
Open Source Code	Yes	3https://github.com/soyoung97/Acu Rank. Justification: Public datasets (TREC-DL, BEIR) are used, and a public code repository is also linked, with the github url of https://github.com/soyoung97/Acu Rank.
Open Datasets	Yes	We evaluate reranking performance on two widely used retrieval benchmarks. For TREC Deep Learning (TREC-DL) [16], we use six standard tracks: DL19 (43), DL20 (54), DL21 (53), DL22 (76), DL23 (82), and DL-Hard (50) [40]. For BEIR [17], following Sun et al. [11], we select eight representative datasets: TREC-COVID (50), NFCorpus (323), Signal-1M (97), News (57), Robust04 (249), Touché (49), DBPedia (400), and Sci Fact (300). The numbers in parentheses indicate the number of queries in each dataset. Justification: Public datasets (TREC-DL, BEIR) are used, and a public code repository is also linked, with the github url of https://github.com/soyoung97/Acu Rank.
Dataset Splits	Yes	For TREC Deep Learning (TREC-DL) [16], we use six standard tracks: DL19 (43), DL20 (54), DL21 (53), DL22 (76), DL23 (82), and DL-Hard (50) [40]. For BEIR [17], following Sun et al. [11], we select eight representative datasets: TREC-COVID (50), NFCorpus (323), Signal-1M (97), News (57), Robust04 (249), Touché (49), DBPedia (400), and Sci Fact (300). The numbers in parentheses indicate the number of queries in each dataset.
Hardware Specification	Yes	All experiments were conducted on a single NVIDIA A6000 GPU (48GB VRAM). Compute nodes: Experiments were conducted on either (i) an ASUS ESC8000-E11 server equipped with dual 4th-Gen Intel Xeon processors, 64 CPU threads, 1.1 TB RAM, and eight NVIDIA A6000 GPUs (48 GB each), or (ii) a workstation with eight RTX 3090 GPUs (24 GB each).
Software Dependencies	Yes	All LLM inference was performed using the transformers library, without acceleration backends such as v LLM or Fast Chat. Greedy decoding was used throughout, with a fixed random seed to ensure reproducibility. We employ the Python trueskill library.7 The relevance of each candidate passage is represented as a trueskill.Rating class. After the listwise reranker outputs an ordering, we update all ratings by calling trueskill.rate function with the argument rating_groups= list of ratings, and ranks=[0,1,...,n], where n is the number of passages. We rely on the open-source Python package trueskill (v0.4.5)13, distributed under the permissive BSD 3-Clause License.
Experiment Setup	Yes	The hyperparameters were selected based on empirical evaluation on a subset of TREC-DL19 and DL20, then applied consistently across all other datasets to ensure a fair comparison. Initialization: We initialize True Skill scores based on first-stage retrieval scores. Specifically, we set the mean µi to the raw retrieval score (e.g., BM25 or SPLADE), and the standard deviation to σi = µi/3. Uncertainty threshold: We select documents whose rank probability si = P(xi > t(k)) falls within the range (ϵ, 1 ϵ). We use ϵ = 0.01 and k = 10 in our default configuration, unless noted otherwise in variant-specific settings. Partitioning strategy: When the number of uncertain documents exceeds the reranker capacity m, we divide them into equally sized groups using sequential partitioning. Otherwise, we rerank using a single batch. For ablation, we also evaluate a random grouping variant. Stopping criterion: We terminate reranking when the number of uncertain documents falls below τ = 10, or the reranker call budget is exhausted. Both rerankers operate over candidate lists of size m = 20, formatted as prompts containing the query and document list. All inputs were truncated to a maximum of 4,096 tokens. Greedy decoding was used throughout, with a fixed random seed to ensure reproducibility.