Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Uncertainty-Guided Exploration for Efficient AlphaZero Training

Authors: Scott Cheng, Meng-Yu Tsai, Ding-Yong Hong, Mahmut T Kandemir

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical findings indicate that branching with 10 variations per game provides the best performance-exploration balance. Overall, our end-to-end results show an improved sample efficiency over the baseline by 58.5% on 9x9 Go in the early stage of training and by 47.3% on 19x19 Go in the late stage of training.
Researcher Affiliation Academia Scott Cheng1 Meng-Yu Tsai2 Ding-Yong Hong3 Mahmut Taylan Kandemir1 1The Pennsylvania State University, USA 2Independent 3Institute of Information Science, Academia Sinica, Taiwan 1EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1: Uncertainty-Guided Branching Input: G number of states to collect Output: G ={(state, policy label, value label)} 2 while |G| < G do 3 (s, a) initial state and action 4 for i 1 to V do 5 {(s , π(s ), zi) | s Si} 6 play episode from (s, a) 7 Compute LCR and sampling weights u (defined in Equation 4) 8 s Y (u) (defined in Equation 5) 9 a π( | s) 10 while (s, a) has been played do 11 s preceding state of s 12 a π( | s) 13 S := SV i=1 Si 14 G G n (s , π(s ), i: s Si zi |{i: s Si}| ) s S o 15 return G
Open Source Code Yes The model checkpoints are provided in https://huggingface.co/ chengscott/ugb_zero.
Open Datasets Yes while for 19x19 Go, we use a model pretrained with 100M samples. Our evaluations employ Elo rating [34] against the state-of-the-art Kata Go program [7], and the Elo rating is converted to the win rate to reflect relative playing strength. Detailed hyperparameters for self-play and training are provided in the appendix A.3. ... Our pretrained model follows previous work [7] to train on the first 100M samples from the public repository [39] and set a baseline of an Elo rating of 9958.
Dataset Splits No while for 19x19 Go, we use a model pretrained with 100M samples. ... We evaluated our proposed method on 9x9 and 19x19 Go. ... Figure 6a shows that V = 10 produces the highest strength and outperforms the V = 1 baseline by a 69% win rate under the same number of training samples, thus achieving the best training efficiency among different variations. ... Figure 7a shows that our method achieves a 62% win rate against the baseline after training on 12M positions in 19x19 Go. -> The paper discusses training samples but not explicit train/test/validation splits in the traditional sense, as data is generated via self-play.
Hardware Specification Yes Our experiments are mainly conducted on 2 NVIDIA A100 GPUs with 768 GB system memory. ... Our end-to-end training experiments in Section 5.4 are conducted on a cluster of 5 nodes, each consisting of 8 NVIDIA V100 GPUs and 768GB system memory.
Software Dependencies No Table 1 shows the hyperparameters for self-play and training. In 9x9 Go, the model consists of 6 Res Net residual blocks [37] with 96 channels each, while in 19x19 Go, the model consists of 5 Nested Bottleneck blocks [38] with 192 channels each, following the model architecture in the latest Kata Go models [39]. -> The paper does not specify software versions for libraries or frameworks.
Experiment Setup Yes Table 1 shows the hyperparameters for self-play and training. In 9x9 Go, the model consists of 6 Res Net residual blocks [37] with 96 channels each, while in 19x19 Go, the model consists of 5 Nested Bottleneck blocks [38] with 192 channels each, following the model architecture in the latest Kata Go models [39]. Our pretrained model follows previous work [7] to train on the first 100M samples from the public repository [39] and set a baseline of an Elo rating of 9958. Moreover, the LCR threshold for Equation 4 is LCR0 = 0.48 for both games. (Table 1 lists MCTS simulation 400, MCTS cpuct 1.25, Dirichlet noise ratio ε 0.25, Dirichlet parameter α 0.12 0.03, G states per iteration 0.3M 1.2M, Optimizer SGD, Batch size 512, Optimizer: momentum 0.9, Optimizer: weight decay 1e-4, Optimizer: learning rate 0.01, Sample factor 2 1 2)