Robust Adversarial Reinforcement Learning

Authors: Lerrel Pinto, James Davidson, Rahul Sukthankar, Abhinav Gupta

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments in multiple environments (Inverted Pendulum, Half Cheetah, Swimmer, Hopper, Walker2d and Ant) conclusively demonstrate that our method (a) improves training stability; (b) is robust to differences in training/test conditions; and c) outperform the baseline even in the absence of the adversary.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Google Brain 3Google Research. Correspondence to: Lerrel Pinto <lerrelp@cs.cmu.edu>.
Pseudocode Yes Algorithm 1 RARL (proposed algorithm)
Open Source Code No The paper does not provide any specific links to source code repositories, nor does it explicitly state that the code for the described methodology is being released or is publicly available.
Open Datasets Yes Our implementation of the adversarial environments build on Open AI Gym s (Brockman et al., 2016) control environments with the Mu Jo Co (Todorov et al., 2012) physics simulator.
Dataset Splits No The paper does not explicitly provide specific percentages or counts for training, validation, and test dataset splits, nor does it refer to predefined splits with citations for reproducibility of data partitioning.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments, only mentioning the use of the Mu Jo Co physics simulator.
Software Dependencies No The paper mentions software like 'rllab', 'Trust Region Policy Optimization (TRPO)', 'Open AI Gym', and 'Mu Jo Co', but it does not provide specific version numbers for these software components, which are required for a reproducible description of ancillary software.
Experiment Setup Yes For all the tasks and for both the protagonist and adversary, we use a policy network with two hidden layers with 64 neurons each. We train both RARL and the baseline for 100 iterations on Inverted Pendulum and for 500 iterations on the other tasks. Hyperparameters of TRPO are selected by grid search.