reproducibilityindex.ai

Model Based Reinforcement Learning for Atari

Authors: Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Błażej Osiński, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, Henryk Michalewski

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments evaluate Sim PLe on a range of Atari games in low data regime of 100k interactions between the agent and the environment, which corresponds to two hours of real-time play. In most games Sim PLe outperforms state-of-the-art model-free algorithms, in some games by over an order of magnitude.
Researcher Affiliation	Collaboration	1Google Brain, 2deepsense.ai, 3Institute of Mathematics of the Polish Academy of Sciences, 4Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, 5University of Illinois at Urbana Champaign, 6Stanford University
Pseudocode	Yes	Algorithm 1: Pseudocode for Sim PLe
Open Source Code	Yes	The source code is available as part of the Tensor2Tensor library and it includes instructions on how to run the experiments3. 3https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/ rl
Open Datasets	Yes	We evaluate Sim PLe on a suite of Atari games from Atari Learning Environment (ALE) benchmark Bellemare et al. (2015); Machado et al. (2018).
Dataset Splits	No	The paper uses terms like 'model training' and 'validation' in the context of the internal workings of their world model (e.g., for stochastic models). However, it does not explicitly provide percentages or counts for overall dataset splits for training, validation, and testing of their full experimental evaluation, as is common for supervised learning tasks. The evaluation is based on interactions within the RL environment.
Hardware Specification	Yes	The whole model has around 74M parameters and the inference/backpropagation time is approx. 0.5s/0.7s respectively, where inference is on batch size 16 and backpropagation on batch size 2, running on NVIDIA Tesla P100. ...extensively used the Prometheus supercomputer, located in the Academic Computer Center Cyfronet in the AGH University of Science and Technology in Kraków, Poland.
Software Dependencies	No	The paper mentions software like 'Tensor2Tensor library', 'Dopamine package', and 'OpenAI baselines' but does not specify their version numbers.
Experiment Setup	Yes	In step 6 we use the proximal policy optimization (PPO) algorithm (Schulman et al., 2017) with γ = 0.95. ... The main loop in Algorithm 1 is iterated 15 times ... The world model is trained for 45K steps in the ﬁrst iteration and for 15K steps in each of the following ones. ... In every PPO epoch we used 16 parallel agents collecting 25, 50 or 100 steps from the simulated environment env ... The number of PPO epochs is z 1000, where z equals to 1 in all passes except last one (where z = 3) and two passes number 8 and 12 (where z = 2). ... a frame skip equal to 4, that is every action is repeated 4 times. The frames are down-scaled by a factor of 2. ... In most cases, we use a stack of four convolutional layers with 64 ﬁlters followed by three dense layers ... We used dropout equal to 0.2 and layer normalization. ... We set C = 10 for L2 loss on pixel values and to C = 0.03 for softmax loss.