Online and Offline Reinforcement Learning by Planning with a Learned Model

Authors: Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, David Silver

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Mu Zero Unplugged sets new state-of-the-art results in the RL Unplugged offline RL benchmark as well as in the online RL benchmark of Atari in the standard 200 million frame setting.
Researcher Affiliation Collaboration Julian Schrittwieser Deep Mind swj@google.com Thomas Hubert Deep Mind tkhubert@google.com Amol Mandhane Deep Mind mandhane@google.com Mohammadamin Barekatain Deep Mind barekatain@google.com Ioannis Antonoglou Deep Mind University College London ioannisa@google.com David Silver Deep Mind University College London davidsilver@google.com
Pseudocode Yes Algorithm 1 The Reanalyse algorithm. Mu Zero Unplugged instantiates representation, predict, dynamics with the Mu Zero network architecture; plan with MCTS; loss with the Mu Zero loss in eqn (1); and optimise with Adam. for step 0...N do t random(1 : T) s0 t = representation(h1:t, θ) for i 0...k do πi t, νi t = plan(representation(h1:t+i, θ), θ) pi t, vi t = predict(si t, θ) ri+1 t , si+1 t = dynamics(si t, at+i, θ) end for l = loss(ht:t+k, {r, p, v, u, π, ν}0:k t , θ) θ = optimise(l, θ) end for
Open Source Code No The paper does not explicitly state that the source code for the methodology described is publicly available, nor does it provide a specific link to a code repository for this work.
Open Datasets Yes We used the RL Unplugged (Gulcehre et al., 2020) benchmark dataset for all offline RL experiments in this paper. To demonstrate the generality of the approach, we report results for both discrete and continuous action spaces as well as state and pixel based data, specifically: DM Control Suite, 9 different tasks, number of frames varies by task (Table 3). Continuous action space with 1 to 21 dimensions, state observations. Atari, 46 games with 200M frames each. Discrete action space, pixel observations, stochasticity through sticky actions (Machado et al., 2017).
Dataset Splits No The paper mentions using the RL Unplugged benchmark dataset and specific data budgets like '1% (2 million frames) or 10% (20 million frames) of Atari data' and the 'standard 200 million frame setting'. However, it does not explicitly provide details about specific train/validation/test dataset splits used for reproduction beyond referring to these benchmarks.
Hardware Specification Yes Google, 2018. Cloud TPU. https://cloud.google.com/tpu/. Accessed: 2019.
Software Dependencies No The paper references software like JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020) but does not provide specific version numbers for these or other key software dependencies used in the experiments.
Experiment Setup No We performed no tuning of hyperparameters for these experiments, instead using the same hyperparameter values as for the online RL case (Schrittwieser et al., 2020; Hubert et al., 2021). This defers the specific hyperparameter values to external sources rather than providing them within the paper's main text.