Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Monte Carlo Tree Search and Reinforcement Learning

Authors: Tom Vodopivec, Spyridon Samothrakis, Branko Ster

JAIR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main experimental goal is to discover whether swapping Monte Carlo (MC) backups with temporal-difference (TD) backups increases the planning performance in MCTS-like algorithms. Therefore, we evaluate whether an eligibility trace decay rate λ < 1 performs better than λ = 1 when temporal-difference tree search (TDTS) algorithms are conﬁgured similarly to MCTS. We also observe the optimal value of λ under different settings and the performance-sensitivity of the new parameters of TDTS methods. First we experiment with several TDTS algorithms on single-player toy games, which allow us to evaluate a large number of conﬁgurations due to the low computation time. Then we proceed with the evaluation of Sarsa-UCT(λ) (Algorithm 1) on classic two-player games, where it learns from self-play, and on real-time arcade video games, where little computation time is available per move.
Researcher Affiliation	Academia	Tom Vodopivec EMAIL Faculty of Computer and Information Science University of Ljubljana Veˇcna pot 113, Ljubljana, Slovenia Spyridon Samothrakis EMAIL Institute of Data Science and Analytics University of Essex Wivenhoe Park, Colchester CO4 3SQ, Essex, U.K. Branko ˇSter EMAIL Faculty of Computer and Information Science University of Ljubljana Veˇcna pot 113, Ljubljana, Slovenia
Pseudocode	Yes	Algorithm 1 An iteration of a tabular ofﬂine Sarsa-UCT(λ) algorithm that builds a tree and evaluates actions with afterstates.
Open Source Code	Yes	The detailed results and the source code of both algorithms are available on the GVG-AI web site (www.gvgai.net).
Open Datasets	Yes	The general video game AI (GVG-AI) competition (Perez et al., 2015) challenges AI players on single-player and two-player arcade games that derive from early computer video games (e.g., Pacman, Sokoban, The legend of Zelda, etc.).
Dataset Splits	No	We experimentally measure the win rate and score of our player on the GVG-AI single-player training sets 1, 2, and 3 (each set consists of 10 games). With each value of λ we compute at least 50 repeats for each of the 5 levels of each of the 30 games.
Hardware Specification	No	No specific hardware details are mentioned in the paper for running the experiments. It only mentions general computing contexts like 'real-time arcade video games' and 'little computation time is available per move'.
Software Dependencies	No	No specific software dependencies with version numbers are mentioned in the paper.
Experiment Setup	Yes	The main control parameters are the eligibility trace decay rate λ [0, 1] and the available computational time per search, which is expressed as the number of either simulated time steps or simulated episodes (MCTS iterations) during planning. Other conﬁguration details of the algorithms: the update step-size α is either inversely decreasing with the number of episodes that visited a state or is held constant at 0.01, 0.05, 0.10, or 0.20; the initialization values Vinit are set constant to 5, 0, 0.5, 1, 5, 10, or 20 (chosen according to the length of the optimal solution of the Shortest walk game); the assumed playout values Qplayout equal Qinit or 0; the ε-greedy policy exploration rate ε is constant at 0.1 or linearly decreases from 0.1 towards 0; in each episode (iteration), the algorithm memorizes either all newly-visited nodes or 1, 5, or 10 newly-visited nodes, in order of visit. The UCB1 exploration rate Cp is ﬁxed on 1. The UCB exploration bias is computed according to Algorithm 1 (line 44).