Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations
Authors: Yongyuan Liang, Yanchao Sun, Ruijie Zheng, Xiangyu Liu, Benjamin Eysenbach, Tuomas Sandholm, Furong Huang, Stephen Marcus McAleer
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on continuous control tasks demonstrate that, compared with prior methods, our approach achieves a higher degree of robustness to various types of attacks on different attack domains, both in settings with temporally-coupled perturbations and decoupled perturbations. Third, we provide empirical results that demonstrate the effectiveness of our approach in defending against both temporally-coupled and non-temporally coupled adversaries on various attack domains. |
| Researcher Affiliation | Academia | University of Maryland, College Park Carnegie Mellon University Princeton University |
| Pseudocode | Yes | Algorithm 1 Policy Space Response Oracles (Lanctot et al., 2017) Algorithm 2 Game-theoretic Response approach for Adversarial Defense (GRAD) Algorithm 3 Action Adversary (AC-AD) Algorithm 4 Policy Adversarial Actor Director (PA-AD) Algorithm 5 Mixed Adversary |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the methodology or a link to a code repository. |
| Open Datasets | Yes | Our experiments are conducted on five various and challenging MuJoCo environments: Hopper, Walker2d, Halfcheetah, Ant, and Humanoid, all using the v2 version of MuJoCo. |
| Dataset Splits | No | The paper describes using MuJoCo environments for experiments, where data is generated through interaction, but does not specify explicit training/validation/test dataset splits or percentages for reproducing data partitioning. |
| Hardware Specification | Yes | Typically, on a single V100 GPU, training GRAD takes around 20 hours for environments like Hopper, Walker2d, and Halfcheetah. |
| Software Dependencies | No | The paper mentions 'Proximal Policy Optimization (PPO) algorithm' and 'Ray rllib' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Our experiments are conducted on five various and challenging MuJoCo environments: Hopper, Walker2d, Halfcheetah, Ant, and Humanoid, all using the v2 version of MuJoCo. We use the Proximal Policy Optimization (PPO) algorithm as the policy optimizer for GRAD training. For attack constraint ϵ, we use the commonly adopted values ϵ for each environment. We set the temporally-coupled constraint ϵ = ϵ/5 (with minor adjustments in some environments). Network Structure Our algorithm (GRAD) adopts the same PPO network structure as the ATLA baselines to maintain consistency. The network comprises a single-layer LSTM with 64 hidden neurons. Additionally, an input embedding layer is employed to project the state dimension to 64, and an output layer is used to project 64 to the output dimension. |