Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations

Authors: Yongyuan Liang, Yanchao Sun, Ruijie Zheng, Xiangyu Liu, Benjamin Eysenbach, Tuomas Sandholm, Furong Huang, Stephen Marcus McAleer

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on continuous control tasks demonstrate that, compared with prior methods, our approach achieves a higher degree of robustness to various types of attacks on different attack domains, both in settings with temporally-coupled perturbations and decoupled perturbations. Third, we provide empirical results that demonstrate the effectiveness of our approach in defending against both temporally-coupled and non-temporally coupled adversaries on various attack domains.
Researcher Affiliation Academia University of Maryland, College Park Carnegie Mellon University Princeton University
Pseudocode Yes Algorithm 1 Policy Space Response Oracles (Lanctot et al., 2017) Algorithm 2 Game-theoretic Response approach for Adversarial Defense (GRAD) Algorithm 3 Action Adversary (AC-AD) Algorithm 4 Policy Adversarial Actor Director (PA-AD) Algorithm 5 Mixed Adversary
Open Source Code No The paper does not provide an explicit statement about releasing source code for the methodology or a link to a code repository.
Open Datasets Yes Our experiments are conducted on five various and challenging MuJoCo environments: Hopper, Walker2d, Halfcheetah, Ant, and Humanoid, all using the v2 version of MuJoCo.
Dataset Splits No The paper describes using MuJoCo environments for experiments, where data is generated through interaction, but does not specify explicit training/validation/test dataset splits or percentages for reproducing data partitioning.
Hardware Specification Yes Typically, on a single V100 GPU, training GRAD takes around 20 hours for environments like Hopper, Walker2d, and Halfcheetah.
Software Dependencies No The paper mentions 'Proximal Policy Optimization (PPO) algorithm' and 'Ray rllib' but does not provide specific version numbers for these software components.
Experiment Setup Yes Our experiments are conducted on five various and challenging MuJoCo environments: Hopper, Walker2d, Halfcheetah, Ant, and Humanoid, all using the v2 version of MuJoCo. We use the Proximal Policy Optimization (PPO) algorithm as the policy optimizer for GRAD training. For attack constraint ϵ, we use the commonly adopted values ϵ for each environment. We set the temporally-coupled constraint ϵ = ϵ/5 (with minor adjustments in some environments). Network Structure Our algorithm (GRAD) adopts the same PPO network structure as the ATLA baselines to maintain consistency. The network comprises a single-layer LSTM with 64 hidden neurons. Additionally, an input embedding layer is employed to project the state dimension to 64, and an output layer is used to project 64 to the output dimension.