SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments
Authors: Glen Berseth, Daniel Geng, Coline Manon Devin, Nicholas Rhinehart, Chelsea Finn, Dinesh Jayaraman, Sergey Levine
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our surprise minimizing agents can successfully play Tetris, Doom, control a humanoid to avoid falls, and navigate to escape enemies in a maze without any task-specific reward supervision. We further show that SMi RL can be used together with standard task rewards to accelerate reward-driven learning. |
| Researcher Affiliation | Academia | Glen Berseth UC Berkeley Daniel Geng UC Berkeley Coline Devin UC Berkeley Nicholas Rhinehart UC Berkeley Chelsea Finn Stanford Dinesh Jayaraman University of Pennsylvania Sergey Levine UC Berkeley |
| Pseudocode | Yes | Algorithm 1 SMi RL |
| Open Source Code | No | The paper provides a link for "Video results" but does not state that the source code for the described methodology is available or provide a link for it. |
| Open Datasets | No | The paper mentions environments like "Tetris", "Viz Doom" (citing Kempka et al., 2016), "Haunted House" (citing Chevalier-Boisvert et al., 2018), and "Simulated Humanoid robots" (citing Berseth et al., 2018), which are frameworks or simulators for running experiments, not pre-collected public datasets with explicit access information. |
| Dataset Splits | No | The paper describes training details such as episode length, replay buffer size, and sample collection, but it does not specify explicit training, validation, or test dataset splits (e.g., as percentages or counts from a static dataset) needed for reproduction. |
| Hardware Specification | No | The paper discusses the use of neural networks and reinforcement learning algorithms (DQN, TRPO) but does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions specific RL algorithms like DQN and TRPO, and toolkits like gym_minigrid, but it does not provide version numbers for these software components or any other libraries used. |
| Experiment Setup | Yes | For all environments trained with Double-DQN (Tetris, Viz Doom, Haunted House) we use a fixed episode length of 500 for training and collect 1000 sample between training rounds that perform 1000 gradient steps on the network. The replay buffer size that is used is 50000. [...] For the Humanoid environments [...] The training collects 4098 sample at a time, performs 64 gradient steps on the value function and one step with TRPO. A fixed variance is used for the policy of 0.2 [...] A kl constraint of 0.2 is used for TRPO and a learning rate of 0.001 is used for training the value function. |