Adversarial Policies: Attacking Deep Reinforcement Learning
Authors: Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We find that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent. |
| Researcher Affiliation | Academia | Adam Gleave1 Michael Dennis Cody Wild Neel Kant Sergey Levine Stuart Russell University of California, Berkeley |
| Pseudocode | No | The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured algorithm blocks. |
| Open Source Code | Yes | Videos and other supplementary material are available online at https://adversarialpolicies.github.io/ and our source code is available on Git Hub at https://github.com/Human Compatible AI/adversarial-policies. |
| Open Datasets | Yes | We attack victim policies for the zero-sum simulated robotics games created by Bansal et al. (2018a), illustrated in Figure 2. The victims were trained in pairs via self-play against random old versions of their opponent, for between 680 and 1360 million timesteps. We use the pre-trained policy weights released in the agent zoo of Bansal et al. (2018b). |
| Dataset Splits | No | The paper mentions 'holding out Zoo*1V for validation' when describing the Gaussian Mixture Model analysis. However, it does not provide exact percentages or sample counts for this validation split, nor does it detail train/validation/test splits for the primary RL policy training. |
| Hardware Specification | Yes | It takes around 8 hours to train an adversary for a single victim using 4 cores of an Intel Xeon Platinum 8000 (Skylake) processor. |
| Software Dependencies | No | The paper mentions using 'PPO implementation from Stable Baselines (Hill et al., 2019)' but does not provide specific version numbers for Stable Baselines or other software libraries, which are necessary for reproducible ancillary software. |
| Experiment Setup | Yes | Table A.1 specifies the hyperparameters used for training. The number of environments was chosen for performance reasons after observing diminishing returns from using more than 8 parallel environments. The total timesteps was chosen by inspection after observing diminishing returns to additional training. The batch size, mini-batches, epochs per update, entropy coefficient and learning rate were tuned via a random search of 100 samples; see Section A in the appendix for details. |