reproducibilityindex.ai

Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

Authors: Joshua Achiam, Shankar Sastry

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our empirical evaluations, we compare the performance of our proposed intrinsic rewards with other heuristic intrinsic reward schemes and to recent results from the literature. In particular, we compare to Variational Information Maximizing Exploration (VIME) [7], a method which approximately maximizes Bayesian surprise and currently achieves state-of-the-art performance on continuous control with sparse rewards. We show that our incentives can perform on the level of VIME at a lower computational cost.
Researcher Affiliation	Academia	Joshua Achiam & Shankar Sastry Department of Electrical Engineering and Computer Science UC Berkeley jachiam@berkeley.edu, sastry@coe.berkeley.edu
Pseudocode	Yes	Algorithm 1 Reinforcement Learning with Surprise Incentive
Open Source Code	No	The paper mentions using "the open-source VIME code [6]" and "the rllab implementations of TRPO and the continuous control tasks [5]" for comparison and as base implementations, but does not state that the code for their specific methodology (surprise-based intrinsic motivation) is open-source or provided.
Open Datasets	Yes	Our continuous control tasks include the slate of sparse reward tasks introduced by Houthooft et al. [7]: sparse Mountain Car, sparse Cart Pole Swingup, and sparse Half Cheetah, as well as a new sparse reward task that we introduce here: sparse Swimmer. [...] The discrete action tasks are several games from the Atari RAM domain of the Open AI Gym [4]: Pong, Bank Heist, Freeway, and Venture.
Dataset Splits	No	The paper does not explicitly provide specific percentages or counts for training, validation, and test splits. It discusses collecting "rollouts" and using a "replay memory" but does not define validation splits in a traditional dataset sense.
Hardware Specification	Yes	Tests were run on a Thinkpad T440p with four physical Intel i7-4700MQ cores, in the sparse Half Cheetah environment.
Software Dependencies	No	The paper mentions using "Trust Region Policy Optimization (TRPO) [20]", "rllab implementations", and "Open AI Gym [4]" but does not provide specific version numbers for these software components or any other libraries/environments.
Experiment Setup	Yes	For all tasks, the MDP discount factor γ was ﬁxed to 0.995, and generalized advantage estimators (GAE) [21] were used, with the GAE λ parameter ﬁxed to 0.95. In the table below, we show several other TRPO hyperparameters. Batch size refers to steps of experience collected at each iteration. The sub-sample factor is for the second-order optimization step, as detailed in Appendix A. (Table 1 provides specific values for Batch Size, Sub-Sample, Max Rollout Length, and δKL for various environments).