Planning in Stochastic Environments with a Learned Model
Authors: Ioannis Antonoglou, Julian Schrittwieser, Sherjil Ozair, Thomas K Hubert, David Silver
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we extend this approach to learn and plan with stochastic models. Specifically, we introduce a new algorithm, Stochastic Mu Zero, that learns a stochastic model incorporating afterstates, and uses this model to perform a stochastic tree search. Stochastic Mu Zero matched or exceeded the state of the art in a set of canonical single and multi-agent environments, including 2048 and backgammon, while maintaining the superhuman performance of standard Mu Zero in the game of Go. |
| Researcher Affiliation | Collaboration | Ioannis Antonoglou1,2 Julian Schrittwieser1 Sherjil Ozair1 Thomas Hubert1 David Silver1,2 1Deep Mind, London, UK 2University College London |
| Pseudocode | Yes | I PSEUDOCODE |
| Open Source Code | No | We did not release the full code as it relies on a lot of proprietary internal infrastructure, limiting its usefulness. |
| Open Datasets | Yes | We applied our algorithm to a variety of challenging stochastic and deterministic environments. First, we evaluated our approach in the classic game of 2048, a stochastic single player game. Subsequently, we considered a two player zero-sum stochastic game, Backgammon, which belongs to the same class of board games such as Go, chess or Shogi where Mu Zero excels... Finally, we evaluated our method in the deterministic game of Go... |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation dataset splits (e.g., percentages or sample counts). Reinforcement learning experiments typically involve learning through interaction rather than fixed dataset splits. |
| Hardware Specification | Yes | All experiments were run using second generation Google Cloud TPUs (Google, 2018). For Backgammon, we used 1 TPU for training and 16 TPUs for acting, for approximately 27 hours equivalent to 10 days on a single V100 GPU. In 2048 we used 1 TPU for training and 4 TPUs for acting, for 80 hours per experiment; equivalent to roughly 8 days on a V100. |
| Software Dependencies | No | The paper mentions 'JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020) libraries' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We trained the model using an Adam (Kingma & Ba, 2015) optimizer and a learning rate of 0.0003 for 20M steps with a batch size of 1024. We used a prioritized replay buffer... and we set α=1... we set β=1. We used a budget of 100 simulations for each MCTS search. We used hyperparameters alpha = 0.25 and fraction = 0.1 for the injected noise. Furthermore, the agent selected actions by sampling from the visit count distribution at the root node at the end of each search. We used a temperature scheduler with values [1.0, 0.5, 0.1] for the first [1e5, 2e5, 3e5] training steps respectively, and a greedy selection thereafter. |