Model Based Reinforcement Learning for Atari
Authors: Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Błażej Osiński, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, Henryk Michalewski
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments evaluate Sim PLe on a range of Atari games in low data regime of 100k interactions between the agent and the environment, which corresponds to two hours of real-time play. In most games Sim PLe outperforms state-of-the-art model-free algorithms, in some games by over an order of magnitude. |
| Researcher Affiliation | Collaboration | 1Google Brain, 2deepsense.ai, 3Institute of Mathematics of the Polish Academy of Sciences, 4Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, 5University of Illinois at Urbana Champaign, 6Stanford University |
| Pseudocode | Yes | Algorithm 1: Pseudocode for Sim PLe |
| Open Source Code | Yes | The source code is available as part of the Tensor2Tensor library and it includes instructions on how to run the experiments3. 3https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/ rl |
| Open Datasets | Yes | We evaluate Sim PLe on a suite of Atari games from Atari Learning Environment (ALE) benchmark Bellemare et al. (2015); Machado et al. (2018). |
| Dataset Splits | No | The paper uses terms like 'model training' and 'validation' in the context of the internal workings of their world model (e.g., for stochastic models). However, it does not explicitly provide percentages or counts for overall dataset splits for training, validation, and testing of their full experimental evaluation, as is common for supervised learning tasks. The evaluation is based on interactions within the RL environment. |
| Hardware Specification | Yes | The whole model has around 74M parameters and the inference/backpropagation time is approx. 0.5s/0.7s respectively, where inference is on batch size 16 and backpropagation on batch size 2, running on NVIDIA Tesla P100. ...extensively used the Prometheus supercomputer, located in the Academic Computer Center Cyfronet in the AGH University of Science and Technology in Kraków, Poland. |
| Software Dependencies | No | The paper mentions software like 'Tensor2Tensor library', 'Dopamine package', and 'OpenAI baselines' but does not specify their version numbers. |
| Experiment Setup | Yes | In step 6 we use the proximal policy optimization (PPO) algorithm (Schulman et al., 2017) with γ = 0.95. ... The main loop in Algorithm 1 is iterated 15 times ... The world model is trained for 45K steps in the first iteration and for 15K steps in each of the following ones. ... In every PPO epoch we used 16 parallel agents collecting 25, 50 or 100 steps from the simulated environment env ... The number of PPO epochs is z 1000, where z equals to 1 in all passes except last one (where z = 3) and two passes number 8 and 12 (where z = 2). ... a frame skip equal to 4, that is every action is repeated 4 times. The frames are down-scaled by a factor of 2. ... In most cases, we use a stack of four convolutional layers with 64 filters followed by three dense layers ... We used dropout equal to 0.2 and layer normalization. ... We set C = 10 for L2 loss on pixel values and to C = 0.03 for softmax loss. |