reproducibilityindex.ai

SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments

Authors: Glen Berseth, Daniel Geng, Coline Manon Devin, Nicholas Rhinehart, Chelsea Finn, Dinesh Jayaraman, Sergey Levine

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our surprise minimizing agents can successfully play Tetris, Doom, control a humanoid to avoid falls, and navigate to escape enemies in a maze without any task-speciﬁc reward supervision. We further show that SMi RL can be used together with standard task rewards to accelerate reward-driven learning.
Researcher Affiliation	Academia	Glen Berseth UC Berkeley Daniel Geng UC Berkeley Coline Devin UC Berkeley Nicholas Rhinehart UC Berkeley Chelsea Finn Stanford Dinesh Jayaraman University of Pennsylvania Sergey Levine UC Berkeley
Pseudocode	Yes	Algorithm 1 SMi RL
Open Source Code	No	The paper provides a link for "Video results" but does not state that the source code for the described methodology is available or provide a link for it.
Open Datasets	No	The paper mentions environments like "Tetris", "Viz Doom" (citing Kempka et al., 2016), "Haunted House" (citing Chevalier-Boisvert et al., 2018), and "Simulated Humanoid robots" (citing Berseth et al., 2018), which are frameworks or simulators for running experiments, not pre-collected public datasets with explicit access information.
Dataset Splits	No	The paper describes training details such as episode length, replay buffer size, and sample collection, but it does not specify explicit training, validation, or test dataset splits (e.g., as percentages or counts from a static dataset) needed for reproduction.
Hardware Specification	No	The paper discusses the use of neural networks and reinforcement learning algorithms (DQN, TRPO) but does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions specific RL algorithms like DQN and TRPO, and toolkits like gym_minigrid, but it does not provide version numbers for these software components or any other libraries used.
Experiment Setup	Yes	For all environments trained with Double-DQN (Tetris, Viz Doom, Haunted House) we use a ﬁxed episode length of 500 for training and collect 1000 sample between training rounds that perform 1000 gradient steps on the network. The replay buffer size that is used is 50000. [...] For the Humanoid environments [...] The training collects 4098 sample at a time, performs 64 gradient steps on the value function and one step with TRPO. A ﬁxed variance is used for the policy of 0.2 [...] A kl constraint of 0.2 is used for TRPO and a learning rate of 0.001 is used for training the value function.