Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator

Authors: Peiwen Yuan, Yiwei Li, Shaoxiong Feng, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Prof. Kan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across various settings demonstrate that SILENCER can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability.
Researcher Affiliation	Collaboration	1 School of Computer Science, Beijing Institute of Technology 2 Xiaohongshu Inc
Pseudocode	Yes	Algorithm 1 Bias-Neutralizing Ensemble Algorithm.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will release our code soon in the github.
Open Datasets	Yes	We select three tasks to evaluate the cross-task effectiveness of SILENCER, with each task paired with a high-quality human-annotated benchmark for comparison: math reasoning (MATH (Hendrycks et al., 2021)), language understanding (MMLU-Pro (Wang et al., 2024)), and commonsense reasoning (Hella Swag (Zellers et al., 2019)).
Dataset Splits	No	Details. Since multiple settings are involved, we default the benchmark size N for each sub-task within a task (e.g., MATH-Algebra) to 50. We explore the impact of larger sizes in Section 5.4. To ensure fairness in the comparison, we keep the benchmark size consistent before and after ensembling.
Hardware Specification	No	Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: We primarily use API calls to access the model generator, and the invocation rate depends on the service provider.
Software Dependencies	No	The paper makes no explicit mention of specific software dependencies with version numbers used for its implementation beyond accessing LLM APIs.
Experiment Setup	Yes	The sampling temperature for the LLMs is set to 1. In addition to self-bias, we also compute the Pearson correlation rp of reference models performance between model-generated and high-quality human-annotated benchmarks to assess the evaluation effectiveness of the generated benchmarks.