Online and Offline Reinforcement Learning by Planning with a Learned Model
Authors: Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, David Silver
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Mu Zero Unplugged sets new state-of-the-art results in the RL Unplugged offline RL benchmark as well as in the online RL benchmark of Atari in the standard 200 million frame setting. |
| Researcher Affiliation | Collaboration | Julian Schrittwieser Deep Mind swj@google.com Thomas Hubert Deep Mind tkhubert@google.com Amol Mandhane Deep Mind mandhane@google.com Mohammadamin Barekatain Deep Mind barekatain@google.com Ioannis Antonoglou Deep Mind University College London ioannisa@google.com David Silver Deep Mind University College London davidsilver@google.com |
| Pseudocode | Yes | Algorithm 1 The Reanalyse algorithm. Mu Zero Unplugged instantiates representation, predict, dynamics with the Mu Zero network architecture; plan with MCTS; loss with the Mu Zero loss in eqn (1); and optimise with Adam. for step 0...N do t random(1 : T) s0 t = representation(h1:t, θ) for i 0...k do πi t, νi t = plan(representation(h1:t+i, θ), θ) pi t, vi t = predict(si t, θ) ri+1 t , si+1 t = dynamics(si t, at+i, θ) end for l = loss(ht:t+k, {r, p, v, u, π, ν}0:k t , θ) θ = optimise(l, θ) end for |
| Open Source Code | No | The paper does not explicitly state that the source code for the methodology described is publicly available, nor does it provide a specific link to a code repository for this work. |
| Open Datasets | Yes | We used the RL Unplugged (Gulcehre et al., 2020) benchmark dataset for all offline RL experiments in this paper. To demonstrate the generality of the approach, we report results for both discrete and continuous action spaces as well as state and pixel based data, specifically: DM Control Suite, 9 different tasks, number of frames varies by task (Table 3). Continuous action space with 1 to 21 dimensions, state observations. Atari, 46 games with 200M frames each. Discrete action space, pixel observations, stochasticity through sticky actions (Machado et al., 2017). |
| Dataset Splits | No | The paper mentions using the RL Unplugged benchmark dataset and specific data budgets like '1% (2 million frames) or 10% (20 million frames) of Atari data' and the 'standard 200 million frame setting'. However, it does not explicitly provide details about specific train/validation/test dataset splits used for reproduction beyond referring to these benchmarks. |
| Hardware Specification | Yes | Google, 2018. Cloud TPU. https://cloud.google.com/tpu/. Accessed: 2019. |
| Software Dependencies | No | The paper references software like JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020) but does not provide specific version numbers for these or other key software dependencies used in the experiments. |
| Experiment Setup | No | We performed no tuning of hyperparameters for these experiments, instead using the same hyperparameter values as for the online RL case (Schrittwieser et al., 2020; Hubert et al., 2021). This defers the specific hyperparameter values to external sources rather than providing them within the paper's main text. |