Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator
Authors: Peiwen Yuan, Yiwei Li, Shaoxiong Feng, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Prof. Kan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across various settings demonstrate that SILENCER can suppress self-bias to near zero, significantly improve evaluation effectiveness of the generated benchmark (with an average improvement from 0.655 to 0.833 in Pearson correlation with high-quality human-annotated benchmark), while also exhibiting strong generalizability. |
| Researcher Affiliation | Collaboration | 1 School of Computer Science, Beijing Institute of Technology 2 Xiaohongshu Inc |
| Pseudocode | Yes | Algorithm 1 Bias-Neutralizing Ensemble Algorithm. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will release our code soon in the github. |
| Open Datasets | Yes | We select three tasks to evaluate the cross-task effectiveness of SILENCER, with each task paired with a high-quality human-annotated benchmark for comparison: math reasoning (MATH (Hendrycks et al., 2021)), language understanding (MMLU-Pro (Wang et al., 2024)), and commonsense reasoning (Hella Swag (Zellers et al., 2019)). |
| Dataset Splits | No | Details. Since multiple settings are involved, we default the benchmark size N for each sub-task within a task (e.g., MATH-Algebra) to 50. We explore the impact of larger sizes in Section 5.4. To ensure fairness in the comparison, we keep the benchmark size consistent before and after ensembling. |
| Hardware Specification | No | Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: We primarily use API calls to access the model generator, and the invocation rate depends on the service provider. |
| Software Dependencies | No | The paper makes no explicit mention of specific software dependencies with version numbers used for its implementation beyond accessing LLM APIs. |
| Experiment Setup | Yes | The sampling temperature for the LLMs is set to 1. In addition to self-bias, we also compute the Pearson correlation rp of reference models performance between model-generated and high-quality human-annotated benchmarks to assess the evaluation effectiveness of the generated benchmarks. |