reproducibilityindex.ai

Online and Offline Reinforcement Learning by Planning with a Learned Model

Authors: Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, David Silver

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Mu Zero Unplugged sets new state-of-the-art results in the RL Unplugged ofﬂine RL benchmark as well as in the online RL benchmark of Atari in the standard 200 million frame setting.
Researcher Affiliation	Collaboration	Julian Schrittwieser Deep Mind swj@google.com Thomas Hubert Deep Mind tkhubert@google.com Amol Mandhane Deep Mind mandhane@google.com Mohammadamin Barekatain Deep Mind barekatain@google.com Ioannis Antonoglou Deep Mind University College London ioannisa@google.com David Silver Deep Mind University College London davidsilver@google.com
Pseudocode	Yes	Algorithm 1 The Reanalyse algorithm. Mu Zero Unplugged instantiates representation, predict, dynamics with the Mu Zero network architecture; plan with MCTS; loss with the Mu Zero loss in eqn (1); and optimise with Adam. for step 0...N do t random(1 : T) s0 t = representation(h1:t, θ) for i 0...k do πi t, νi t = plan(representation(h1:t+i, θ), θ) pi t, vi t = predict(si t, θ) ri+1 t , si+1 t = dynamics(si t, at+i, θ) end for l = loss(ht:t+k, {r, p, v, u, π, ν}0:k t , θ) θ = optimise(l, θ) end for
Open Source Code	No	The paper does not explicitly state that the source code for the methodology described is publicly available, nor does it provide a specific link to a code repository for this work.
Open Datasets	Yes	We used the RL Unplugged (Gulcehre et al., 2020) benchmark dataset for all ofﬂine RL experiments in this paper. To demonstrate the generality of the approach, we report results for both discrete and continuous action spaces as well as state and pixel based data, speciﬁcally: DM Control Suite, 9 different tasks, number of frames varies by task (Table 3). Continuous action space with 1 to 21 dimensions, state observations. Atari, 46 games with 200M frames each. Discrete action space, pixel observations, stochasticity through sticky actions (Machado et al., 2017).
Dataset Splits	No	The paper mentions using the RL Unplugged benchmark dataset and specific data budgets like '1% (2 million frames) or 10% (20 million frames) of Atari data' and the 'standard 200 million frame setting'. However, it does not explicitly provide details about specific train/validation/test dataset splits used for reproduction beyond referring to these benchmarks.
Hardware Specification	Yes	Google, 2018. Cloud TPU. https://cloud.google.com/tpu/. Accessed: 2019.
Software Dependencies	No	The paper references software like JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020) but does not provide specific version numbers for these or other key software dependencies used in the experiments.
Experiment Setup	No	We performed no tuning of hyperparameters for these experiments, instead using the same hyperparameter values as for the online RL case (Schrittwieser et al., 2020; Hubert et al., 2021). This defers the specific hyperparameter values to external sources rather than providing them within the paper's main text.