Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

Authors: Joshua Achiam, Shankar Sastry

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our empirical evaluations, we compare the performance of our proposed intrinsic rewards with other heuristic intrinsic reward schemes and to recent results from the literature. In particular, we compare to Variational Information Maximizing Exploration (VIME) [7], a method which approximately maximizes Bayesian surprise and currently achieves state-of-the-art performance on continuous control with sparse rewards. We show that our incentives can perform on the level of VIME at a lower computational cost.
Researcher Affiliation Academia Joshua Achiam & Shankar Sastry Department of Electrical Engineering and Computer Science UC Berkeley jachiam@berkeley.edu, sastry@coe.berkeley.edu
Pseudocode Yes Algorithm 1 Reinforcement Learning with Surprise Incentive
Open Source Code No The paper mentions using "the open-source VIME code [6]" and "the rllab implementations of TRPO and the continuous control tasks [5]" for comparison and as base implementations, but does not state that the code for *their specific methodology* (surprise-based intrinsic motivation) is open-source or provided.
Open Datasets Yes Our continuous control tasks include the slate of sparse reward tasks introduced by Houthooft et al. [7]: sparse Mountain Car, sparse Cart Pole Swingup, and sparse Half Cheetah, as well as a new sparse reward task that we introduce here: sparse Swimmer. [...] The discrete action tasks are several games from the Atari RAM domain of the Open AI Gym [4]: Pong, Bank Heist, Freeway, and Venture.
Dataset Splits No The paper does not explicitly provide specific percentages or counts for training, validation, and test splits. It discusses collecting "rollouts" and using a "replay memory" but does not define validation splits in a traditional dataset sense.
Hardware Specification Yes Tests were run on a Thinkpad T440p with four physical Intel i7-4700MQ cores, in the sparse Half Cheetah environment.
Software Dependencies No The paper mentions using "Trust Region Policy Optimization (TRPO) [20]", "rllab implementations", and "Open AI Gym [4]" but does not provide specific version numbers for these software components or any other libraries/environments.
Experiment Setup Yes For all tasks, the MDP discount factor γ was fixed to 0.995, and generalized advantage estimators (GAE) [21] were used, with the GAE λ parameter fixed to 0.95. In the table below, we show several other TRPO hyperparameters. Batch size refers to steps of experience collected at each iteration. The sub-sample factor is for the second-order optimization step, as detailed in Appendix A. (Table 1 provides specific values for Batch Size, Sub-Sample, Max Rollout Length, and δKL for various environments).