An Analysis of Monte Carlo Tree Search

Authors: Steven James, George Konidaris, Benjamin Rosman

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experimental evidence suggesting that, under certain smoothness conditions, uniformly random simulation policies preserve the ordering over action preferences. This explains the success of MCTS despite its common use of these rollouts to evaluate states. We further analyse non-uniformly random rollout policies and describe conditions under which they offer improved performance.
Researcher Affiliation Academia University of the Witwatersrand, Johannesburg, South Africa Brown University, Providence RI 02912, USA Council for Scientific and Industrial Research, Pretoria, South Africa
Pseudocode No The paper describes the phases of Monte Carlo Tree Search (Figure 1) and details the UCB1 policy, but does not include a formal pseudocode block or algorithm listing.
Open Source Code No No explicit statement or link regarding the release of open-source code for the described methodology was found.
Open Datasets No The paper describes generic environments like 'finite Markov Decision Process (MDP)', 'chain walk', and 'grid world', and details how data like 'random transition functions' were generated. However, it does not provide concrete access information (link, DOI, repository, or formal citation) to a specific publicly available dataset used for training.
Dataset Splits No The paper describes experiments run on generated environments (e.g., '10x10 grid', 'ternary tree', 'k-ary tree') and specifies parameters like '5000 iterations per move' or '50 000 iterations'. However, it does not explicitly provide specific training, validation, or test dataset splits (e.g., percentages or sample counts) for a pre-existing dataset.
Hardware Specification No No specific hardware details such as GPU models, CPU types, or memory configurations used for running experiments are mentioned in the paper.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, or solvers).
Experiment Setup Yes At each state s, we store the visitation count ns and average return Xs. For a given node s, the policy then selects child i that maximises the upper confidence bound where Cp is an exploration parameter. Once a leaf node is reached, the expansion phase adds a new node to the tree. A simulation is then run from this node according to the rollout policy, with the outcome being backpropagated up through the tree, updating the nodes average scores and visitation counts. This cycle of selection, expansion, simulation and backpropagation is repeated until some halting criteria is met, at which point the best action is selected. There are numerous strategies for the final action selection, such as choosing the action that leads to the most visited state (robust child), or the one that leads to the highest valued state (max child). We adopt the robust child strategy throughout this paper, noting that previous results have shown little difference between the two (Chaslot et al. 2008). Beginning with a uniformly random policy, we perform one step of policy iteration (γ = 0.95) on a 10 × 10 grid. The resulting policy can then be used to measure how accurately the initial policy evaluated states. As we are only interested in selecting the correct action (and not, say, the values of states), we judge its correctness by the percentage of states in which the policy suggests an optimal action. The results of this experiment are illustrated by Figure 2, and demonstrate that as the bound on the Lipschitz condition increases, the uniformly random policy becomes less and less effective in suggesting the correct action. We execute UCT on different instantiations of the domain, allowing 5000 iterations per move. However, instead of simulating actions until a terminal state is reached, our rollout phase directly samples the reward from a Gaussian distribution centred about the true value with varying standard deviation σ. The search spends time at the suboptimal maxima before switching to the region x < 0.5. After a sufficient number of simulations, however, UCT does indeed start to visit the optimal region of the graph (Figure 5). We limit the MCTS algorithm to 30 iterations per move to simulate an environment in which the state-space is far greater than what would be computable given the available resources. Both the mean and standard deviation are incrementally varied from 0 to 4, and are used to parameterise a UCT agent. The agent is then tested on 10 000 different instances of the tree.