Monte Carlo Tree Search for Policy Optimization
Authors: Xiaobai Ma, Katherine Driggs-Campbell, Zongzhang Zhang, Mykel J. Kochenderfer
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate improved performance on reinforcement learning tasks with deceptive or sparse reward functions compared to popular gradient-based and deep genetic algorithm baselines. We compare the performance of MCTSPO to two state-of-the-art baselines: TRPO as a representative of gradient-based methods [Schulman et al., 2015a] and Deep GA using safe mutation [Lehman et al., 2018]. The results and the corresponding discussion are presented in Section 5. |
| Researcher Affiliation | Academia | 1Aeronautics and Astronautics Department, Stanford University 2Electrical and Computer Engineering Department, University of Illinois Urbana-Champaign 3National Key Laboratory for Novel Software Technology, Nanjing University maxiaoba@stanford.com, krdc@illinois.edu, zhangzongzhang@gmail.com, mykel@stanford.edu |
| Pseudocode | Yes | Algorithm 1 MCTS for Policy Optimization (MCTSPO) function MCTSPO(Task environment Γ, Initial state s0) ... Algorithm 2 Rollout function ROLLOUT(Γ, s) ... Algorithm 3 Get candidate actions function GETCA(s, τs) ... |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is open-source or provide a link to a code repository. |
| Open Datasets | Yes | Three classic continuous control tasks, Acrobot [Geramifard et al., 2015], Mountain Car [Moore, 1991], and Bipedal Walker [Brockman et al., 2016], are tested... we adapt three robotics environments, Ant [Schulman et al., 2015b], Half Cheetah [Wawrzynski, 2007], and Hopper [Murthy and Raibert, 1984], from Open AI Roboschool [Schulman et al., 2017]. |
| Dataset Splits | No | The paper mentions 'train' for the algorithms and 'test' for evaluation but does not specify a separate 'validation' split or exact percentages for any splits within the main text. |
| Hardware Specification | No | The paper mentions that 'The training clock time for Deep GA and MCTSPO is approximately twice that of TRPO. This difference is mainly caused by the single-threaded sampling in our Deep GA and MCTSPO implementation.' (footnote 3), implying computational resources were used, but no specific hardware details (like CPU/GPU models, RAM) are provided for the experiments. |
| Software Dependencies | No | The paper mentions RLLab for TRPO architecture and Open AI Roboschool for environments, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For TRPO, we use the Gaussian multilayer perceptron architecture from RLLab [Duan et al., 2016] with hidden layer sizes of 128, 64, and 32 with tanh activations. It is trained for 5000 iterations using step sizes 0.1 and 1.0. The batch size is set to 1000 for classic control tasks and 5000 for Roboschool. For Deep GA, we use the deterministic multilayer perceptron architecture with the same network structure as used in TRPO. The population sizes are 100, 500, and 1000 with 500, 100, and 50 training iterations, respectively. The truncation size for parent selection is 20. At each iteration, the top three individuals persist to the next generation with no mutation, following a technique called elitism [Such et al., 2017]. The divergence constraint for the mutation step is set to 1.0 through preliminary tests. For MCTSPO, we use the same architecture and batch size as in Deep GA. We use an exploration constant of 2 for classic control tasks and 10 for Roboschool. The progressive widening parameters are set to α = k = 0.3, 0.5, and 0.8, respectively. The number of candidate actions is set to nca = 4 to balance the computation complexity and the sample efficiency. We train for 50,000 iterations with the same divergence constraint as used in Deep GA. |