Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations
Authors: Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, Cho-Jui Hsieh
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on 10 environments ranging from Atari games with discrete actions to complex control tasks in continuous action space. Our proposed method significantly improves robustness under strong white-box attacks on state observations, including two strong attacks we design, the robust Sarsa attack (RS attack) and maximal action difference attack (MAD attack). |
| Researcher Affiliation | Collaboration | Huan Zhang*,1 Hongge Chen*,2 Chaowei Xiao3 Bo Li4 Mingyan Liu5 Duane Boning2 Cho-Jui Hsieh1 1UCLA 2 MIT 3NVIDIA 4UIUC 5University of Michigan |
| Pseudocode | Yes | We show the full SA-DDPG algorithm in Appendix G.1. ... We defer the details on solving RDQN(θ) and full SA-DQN algorithm to Appendix H. |
| Open Source Code | Yes | Our code is available at https://github.com/chenhongge/State Adv DRL. |
| Open Datasets | Yes | We use the PPO implementation from [14], which conducted hyperparameter search and published the optimal hyperparameters for PPO on three Mujoco environments in Open AI Gym [7]. ... We implement Double DQN [72] and Prioritized Experience Replay [58] on four Atari games. |
| Dataset Splits | No | The paper uses standard RL environments like Mujoco and Atari. While it mentions hyperparameter tuning (e.g., for κPPO and κDDPG), it does not specify explicit dataset splits (e.g., percentages or sample counts for training, validation, and test sets) in the way a supervised learning paper would. RL typically uses episodes/steps for training and evaluation without formal dataset partitions. |
| Hardware Specification | No | The paper does not explicitly describe the hardware (e.g., specific GPU or CPU models, memory) used to run its experiments. |
| Software Dependencies | No | The paper mentions using implementations like PPO from [14], DDPG from [61], Double DQN [72], Prioritized Experience Replay [58], and auto_LiRPA [83], but it does not specify concrete version numbers for these software dependencies (e.g., PyTorch 1.x, Python 3.x). |
| Experiment Setup | Yes | We use their optimal hyperparameters for PPO, and the same set of hyperparameters for SA-PPO without further tuning. ... SA-PPO has one additional regularization parameter, κPPO, for the regularizer RPPO, which is chosen in {0.003, 0.01, 0.03, 0.1, 0.3, 1.0}. ... We run Walker2d and Hopper 2 10^6 steps and Humanoid 1 10^7 steps to ensure convergence. ... We train Atari agents for 6 million frames for both vanilla DQN and SA-DQN. ... Detailed hyperparameters are in Appendix F. |