reproducibilityindex.ai

Adversarial Policies: Attacking Deep Reinforcement Learning

Authors: Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We ﬁnd that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent.
Researcher Affiliation	Academia	Adam Gleave1 Michael Dennis Cody Wild Neel Kant Sergey Levine Stuart Russell University of California, Berkeley
Pseudocode	No	The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured algorithm blocks.
Open Source Code	Yes	Videos and other supplementary material are available online at https://adversarialpolicies.github.io/ and our source code is available on Git Hub at https://github.com/Human Compatible AI/adversarial-policies.
Open Datasets	Yes	We attack victim policies for the zero-sum simulated robotics games created by Bansal et al. (2018a), illustrated in Figure 2. The victims were trained in pairs via self-play against random old versions of their opponent, for between 680 and 1360 million timesteps. We use the pre-trained policy weights released in the agent zoo of Bansal et al. (2018b).
Dataset Splits	No	The paper mentions 'holding out Zoo*1V for validation' when describing the Gaussian Mixture Model analysis. However, it does not provide exact percentages or sample counts for this validation split, nor does it detail train/validation/test splits for the primary RL policy training.
Hardware Specification	Yes	It takes around 8 hours to train an adversary for a single victim using 4 cores of an Intel Xeon Platinum 8000 (Skylake) processor.
Software Dependencies	No	The paper mentions using 'PPO implementation from Stable Baselines (Hill et al., 2019)' but does not provide specific version numbers for Stable Baselines or other software libraries, which are necessary for reproducible ancillary software.
Experiment Setup	Yes	Table A.1 speciﬁes the hyperparameters used for training. The number of environments was chosen for performance reasons after observing diminishing returns from using more than 8 parallel environments. The total timesteps was chosen by inspection after observing diminishing returns to additional training. The batch size, mini-batches, epochs per update, entropy coefﬁcient and learning rate were tuned via a random search of 100 samples; see Section A in the appendix for details.