Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go

Authors: Yichuan Ma, Linyang Li, Yongkang Chen, Peiji Li, Jiasheng Ye, Qipeng Guo, Dahua Lin, Kai Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Table 1 presents our primary experimental results. First, both our 7B and 32B model achieve Go-specific performance that significantly surpasses all existing general LLMs. On Kata Go-Bench-1K, the strongest general model apart from Lo Gos is Claude3.7-Sonnet, which achieves a prediction accuracy of 34.3%. In contrast, our models achieve nearly 2.6 times the accuracy of Claude3.7-Sonnet and even exceed the performance of Kata Go-Human SL-9d (88.6% and 87.8%), indicating that our models attain proficiency in Go comparable to professional players.
Researcher Affiliation	Collaboration	Yichuan Ma1,2, Linyang Li1 , Yongkang Chen1, Peiji Li1,2, Jiasheng Ye2, Qipeng Guo1, Dahua Lin1, Kai Chen1, 1Shanghai AI Laboratory, Shanghai, 2School of Computer Science, Fudan University, Shanghai
Pseudocode	No	No explicit pseudocode or algorithm block found. The paper describes the methodology in prose and mathematical formulations. Figure 4 shows a 'heuristic template' which describes steps for data construction, but not a general algorithm for the model itself.
Open Source Code	No	We will release the first large-scale Go dataset for LLM training, the first LLM Go evaluation benchmark, and the first general LLM that reaches human professional-level performance in Go at: https://github.com/Entarochuan/Lo Gos. Additionally, the NeurIPS Paper Checklist for 'open access to data and code' states: 'Answer: [No] Justification: We will release the models, datasets and the evaluation benchmark later.'
Open Datasets	Yes	For the selection of long Co T reasoning data, we collect several distilled reasoning datasets covering a wide range of general tasks including code, mathematics, and general reasoning. Specifically, our collected datasets include Openthoughts-114K [Team, 2025], Numina Math-Qw QCo T-5M [Team et al., 2025], Open Code Reasoning [Ahmad et al., 2025], Bespoke-Stratos-17k [Labs, 2025], and AM-Deep Seek-R1-Distilled-1.4M [Zhao et al., 2025].
Dataset Splits	No	Next Step Prediction Dataset We collect a dataset containing over 5 million game records played by both top amateur and professional Go players. From these game records, we uniformly sample over 10 million game states and annotate them using the open-source Go engine Kata Go [Wu, 2019]. Commentary Dataset We collect and process 100K Go commentary cases from open resources, each containing an independent game state and the corresponding comment. Benchmarks We propose Kata Go-Bench-1K, our original benchmark for measuring LLMs Go capability. Kata Go-Bench-1K is a test set of 1,000 samples from Kata Go annotation data, with game states sampled across various player skill levels. The paper specifies dataset sizes and a test set, but does not provide explicit training/validation splits for the main Go dataset or details on how the mixed datasets are partitioned for training and validation.
Hardware Specification	Yes	For training the 7B model, we utilized 32 A800 (80GB) GPUs, while the 32B model required 64 A800 GPUs. The 7B model training utilizes 32 A800 GPUs, while the 32B model reinforcement learning phase requires 64 GPUs.
Software Dependencies	No	Our implementation is primarily based on modifications to the Ver L framework [Sheng et al., 2024]. The paper mentions a framework but does not specify its version number or any other software dependencies with version information (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	In the mixed cold start phase, we SFT the base models with a maximum sequence length of 32,768 tokens. We employed a cosine annealing learning rate scheduler with rates ranging from 4e-5 to 4e-6. Regarding specific parameter settings, we configure the training batch size to 64, with 16 roll-outs per data point and a maximum response length of 8,192 tokens. Due to the significant distribution gap between Go task responses and the reference model s pretraining data, we set the KL coefficient (kl_coef) to 5e-4.