Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Monte Carlo Tree Search and Reinforcement Learning

Authors: Tom Vodopivec, Spyridon Samothrakis, Branko Ster

JAIR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main experimental goal is to discover whether swapping Monte Carlo (MC) backups with temporal-difference (TD) backups increases the planning performance in MCTS-like algorithms. Therefore, we evaluate whether an eligibility trace decay rate λ < 1 performs better than λ = 1 when temporal-difference tree search (TDTS) algorithms are configured similarly to MCTS. We also observe the optimal value of λ under different settings and the performance-sensitivity of the new parameters of TDTS methods. First we experiment with several TDTS algorithms on single-player toy games, which allow us to evaluate a large number of configurations due to the low computation time. Then we proceed with the evaluation of Sarsa-UCT(λ) (Algorithm 1) on classic two-player games, where it learns from self-play, and on real-time arcade video games, where little computation time is available per move.
Researcher Affiliation Academia Tom Vodopivec EMAIL Faculty of Computer and Information Science University of Ljubljana Veˇcna pot 113, Ljubljana, Slovenia Spyridon Samothrakis EMAIL Institute of Data Science and Analytics University of Essex Wivenhoe Park, Colchester CO4 3SQ, Essex, U.K. Branko ˇSter EMAIL Faculty of Computer and Information Science University of Ljubljana Veˇcna pot 113, Ljubljana, Slovenia
Pseudocode Yes Algorithm 1 An iteration of a tabular offline Sarsa-UCT(λ) algorithm that builds a tree and evaluates actions with afterstates.
Open Source Code Yes The detailed results and the source code of both algorithms are available on the GVG-AI web site (www.gvgai.net).
Open Datasets Yes The general video game AI (GVG-AI) competition (Perez et al., 2015) challenges AI players on single-player and two-player arcade games that derive from early computer video games (e.g., Pacman, Sokoban, The legend of Zelda, etc.).
Dataset Splits No We experimentally measure the win rate and score of our player on the GVG-AI single-player training sets 1, 2, and 3 (each set consists of 10 games). With each value of λ we compute at least 50 repeats for each of the 5 levels of each of the 30 games.
Hardware Specification No No specific hardware details are mentioned in the paper for running the experiments. It only mentions general computing contexts like 'real-time arcade video games' and 'little computation time is available per move'.
Software Dependencies No No specific software dependencies with version numbers are mentioned in the paper.
Experiment Setup Yes The main control parameters are the eligibility trace decay rate λ [0, 1] and the available computational time per search, which is expressed as the number of either simulated time steps or simulated episodes (MCTS iterations) during planning. Other configuration details of the algorithms: the update step-size α is either inversely decreasing with the number of episodes that visited a state or is held constant at 0.01, 0.05, 0.10, or 0.20; the initialization values Vinit are set constant to 5, 0, 0.5, 1, 5, 10, or 20 (chosen according to the length of the optimal solution of the Shortest walk game); the assumed playout values Qplayout equal Qinit or 0; the ε-greedy policy exploration rate ε is constant at 0.1 or linearly decreases from 0.1 towards 0; in each episode (iteration), the algorithm memorizes either all newly-visited nodes or 1, 5, or 10 newly-visited nodes, in order of visit. The UCB1 exploration rate Cp is fixed on 1. The UCB exploration bias is computed according to Algorithm 1 (line 44).