Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

From Self-Check to Consensus: Bayesian Strategic Decoding in Large Language Models

Authors: Weitong Zhang, Chengqi Zang, Bernhard Kainz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first assess efficiency and reliability through a multidimensional comparison with another game-theoretic method [31] and several variants. Then, we evaluate performance on a diverse set of LLMs used for real-world tasks: MMLU [29], ARC-Easy (E.), -Challenge (C.) [18], RACE-High (H.) [37].
Researcher Affiliation Academia Weitong Zhang ,1 Chengqi Zang ,2 Bernhard Kainz1,4 1 Imperial College London, UK, 2 University of Tokyo, JP, 4Friedrich-Alexander University Erlangen-Nürnberg, GER EMAIL
Pseudocode Yes Algorithm 1 BDG: Self-checking Mechanism for LLMs
Open Source Code No We provide model-agnostic pseudocode implementation in Algorithm 1 (Appendix Section 1) and detailed instructions for reproducing our results. The benchmarks we used (MMLU, ARC, RACE, etc.) are publicly available datasets, and we used open-source LLa MA models for our experiments.
Open Datasets Yes We conducted our evaluations using widely recognized benchmarks such as ARC-Easy, ARCChallenge, MMLU, and RACE. The experiments were performed using the open-source LLa MA 7B and 13B models.
Dataset Splits Yes We conducted our evaluations using widely recognized benchmarks such as ARC-Easy, ARCChallenge, MMLU, and RACE. The experiments were performed using the open-source LLa MA 7B and 13B models.
Hardware Specification Yes All experiments were conducted on NVIDIA A6000 and A100 GPUs, with runtimes ranging from 0.5 to 6 hours depending on the model size, task, and experimental settings.
Software Dependencies No The paper mentions using LLa MA models (7B, 13B parameters) and 16-bit inference, but does not provide specific version numbers for software dependencies like programming languages or libraries (e.g., PyTorch, TensorFlow).
Experiment Setup Yes Hyperparameters. We set ηD, λD and ηG, λG with 0.1 compared to ECG. Experiments are run 5000 times with early stopping based on equilibrium convergence. BDG can usually converge by 500 iterations or less. The hyperparameters can be larger according to the tasks and initial model ability.