Convex Regularization in Monte-Carlo Tree Search
Authors: Tuan Q Dam, Carlo D’Eramo, Jan Peters, Joni Pajarinen
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify the consequence of our theoretical results on a toy problem. Finally, we show how our framework can easily be incorporated in Alpha Go and we empirically show the superiority of convex regularization, w.r.t. representative baselines, on wellknown RL problems across several Atari games. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Technische Universit at Darmstadt, Germany 2Department of Electrical Engineering and Automation, Aalto University, Finland. |
| Pseudocode | No | The paper describes algorithms (e.g., UCT, E3W) and mathematical formulations, but it does not include a clearly labeled pseudocode block or algorithm listing. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing its source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Atari. Atari 2600 (Bellemare et al., 2013) is a popular benchmark for testing deep RL methodologies |
| Dataset Splits | No | The paper mentions using a pretrained Deep Q-Network for initialization and conducting experimental runs with MCTS simulations, but it does not provide specific train/validation/test dataset splits for its own model training or evaluation setup. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using a 'Deep Q-Network' and incorporating the framework into 'Alpha Go', but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow, specific libraries) that would be needed for reproducibility. |
| Experiment Setup | Yes | For a fair comparison, we use fixed τ = 0.1 and ϵ = 0.1 across all algorithms. ... Each experimental run consists of 512 MCTS simulations. The temperature τ is optimized for each algorithm and game via grid-search between 0.01 and 1. The discount factor is γ = 0.99, and for PUCT the exploration constant is c = 0.1. |