Adversarial Intrinsic Motivation for Reinforcement Learning

Authors: Ishan Durugkar, Mauricio Tec, Scott Niekum, Peter Stone

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that this reward function changes smoothly with respect to transitions in the MDP and directs the agent s exploration to find the goal efficiently. Additionally, we combine AIM with Hindsight Experience Replay (HER) and show that the resulting algorithm accelerates learning significantly on several simulated robotics tasks when compared to other rewards that encourage exploration or accelerate learning.
Researcher Affiliation Collaboration Ishan Durugkar Department of Computer Science The University of Texas at Austin Austin, TX, USA 78703 ishand@cs.utexas.edu Mauricio Tec Department of Statistics and Data Sciences The University of Texas at Austin Austin, TX, USA 78703 mauriciogtec@utexas.edu Scott Niekum Department of Computer Science The University of Texas at Austin Austin, TX, USA 78703 sniekum@cs.utexas.edu Peter Stone Department of Computer Science The University of Texas at Austin Austin, TX, USA 78703 and Sony AI pstone@cs.utexas.edu
Pseudocode Yes The basic procedure to learn and use adversarial intrinsic motivation (AIM) is laid out in Algorithm 1, and also includes how to use this algorithm in conjunction with HER.
Open Source Code No The paper states "We used the HER implementation using Twin Delayed DDPG (TD3) [26] as the underlying RL algorithm from the stable baselines repository [38]." but does not provide a link or statement about its own open-source code for AIM.
Open Datasets Yes The Fetch robot tasks from Open AI gym [15] which have been used to evaluate learning of goal-conditioned policies previously [1, 80]. Descriptions of these tasks and their goal space is in Appendix H. We soften the Dirac target distribution for continuous states to instead be a Gaussian with variance of 0.01 of the range of each feature.
Dataset Splits No The paper mentions "We did an extensive sweep of the hyperparameters for the baseline HER + R (laid out in Appendix H), with a coarser search on relevant hyperparameters for AIM." This indicates hyperparameter tuning, but it does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, and testing.
Hardware Specification No The paper describes experiments in "simulated robotics tasks" and the "Mu Jo Co simulator", but does not provide any specific hardware details such as GPU or CPU models, memory, or cloud instance types.
Software Dependencies No We used the HER implementation using Twin Delayed DDPG (TD3) [26] as the underlying RL algorithm from the stable baselines repository [38]. While 'stable-baselines' is mentioned, no specific version number for it or other software dependencies is provided.
Experiment Setup Yes We did an extensive sweep of the hyperparameters for the baseline HER + R (laid out in Appendix H), with a coarser search on relevant hyperparameters for AIM.