Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning
Authors: Joshua Achiam, Shankar Sastry
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our empirical evaluations, we compare the performance of our proposed intrinsic rewards with other heuristic intrinsic reward schemes and to recent results from the literature. In particular, we compare to Variational Information Maximizing Exploration (VIME) [7], a method which approximately maximizes Bayesian surprise and currently achieves state-of-the-art performance on continuous control with sparse rewards. We show that our incentives can perform on the level of VIME at a lower computational cost. |
| Researcher Affiliation | Academia | Joshua Achiam & Shankar Sastry Department of Electrical Engineering and Computer Science UC Berkeley jachiam@berkeley.edu, sastry@coe.berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Reinforcement Learning with Surprise Incentive |
| Open Source Code | No | The paper mentions using "the open-source VIME code [6]" and "the rllab implementations of TRPO and the continuous control tasks [5]" for comparison and as base implementations, but does not state that the code for *their specific methodology* (surprise-based intrinsic motivation) is open-source or provided. |
| Open Datasets | Yes | Our continuous control tasks include the slate of sparse reward tasks introduced by Houthooft et al. [7]: sparse Mountain Car, sparse Cart Pole Swingup, and sparse Half Cheetah, as well as a new sparse reward task that we introduce here: sparse Swimmer. [...] The discrete action tasks are several games from the Atari RAM domain of the Open AI Gym [4]: Pong, Bank Heist, Freeway, and Venture. |
| Dataset Splits | No | The paper does not explicitly provide specific percentages or counts for training, validation, and test splits. It discusses collecting "rollouts" and using a "replay memory" but does not define validation splits in a traditional dataset sense. |
| Hardware Specification | Yes | Tests were run on a Thinkpad T440p with four physical Intel i7-4700MQ cores, in the sparse Half Cheetah environment. |
| Software Dependencies | No | The paper mentions using "Trust Region Policy Optimization (TRPO) [20]", "rllab implementations", and "Open AI Gym [4]" but does not provide specific version numbers for these software components or any other libraries/environments. |
| Experiment Setup | Yes | For all tasks, the MDP discount factor γ was fixed to 0.995, and generalized advantage estimators (GAE) [21] were used, with the GAE λ parameter fixed to 0.95. In the table below, we show several other TRPO hyperparameters. Batch size refers to steps of experience collected at each iteration. The sub-sample factor is for the second-order optimization step, as detailed in Appendix A. (Table 1 provides specific values for Batch Size, Sub-Sample, Max Rollout Length, and δKL for various environments). |