Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

Authors: Xiangyu Liu, Chenghao Deng, Yanchao Sun, Yongyuan Liang, Furong Huang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios. ... (3) Empirical investigations. Through empirical studies on Mujoco, we validate the effectiveness of PROTECTED, demonstrating both improved natural performance and robustness, as well as adaptability against unknown and dynamic attacks.
Researcher Affiliation Collaboration Xiangyu Liu 1, Chenghao Deng 1, Yanchao Sun2, Yongyuan Liang1, Furong Huang1 1University of Maryland, College Park, 2J.P. Morgan AI Research
Pseudocode Yes Algorithm 1 Online adaptation with refined policy class... Algorithm 2 Iterative discovery of non-dominated policy class
Open Source Code Yes Codes are available at https://github.com/umd-huang-lab/PROTECTED.git
Open Datasets Yes For empirical studies, we implement our framework in four Mujoco environments with continuous action spaces, specifically, Hopper, Walker2d, Halfcheetah, and Ant, adhering to a setup similar to most related works (Zhang et al., 2020a; 2021; Sun et al., 2019; Liang et al., 2022).
Dataset Splits No The paper mentions using Mujoco environments and details training steps and evaluation over a number of episodes, but it does not specify explicit train/validation/test dataset splits with percentages, sample counts, or references to predefined splits.
Hardware Specification Yes All experiments are conducted on NVIDIA Ge Force RTX 2080 Ti GPU.
Software Dependencies No The paper mentions 'PPO' as an optimizer and 'Mujoco environments' but does not specify software names with version numbers for libraries, frameworks, or operating systems (e.g., Python version, PyTorch/TensorFlow version, Mujoco version).
Experiment Setup Yes For the network structure, we employ a single-layer LSTM with 64 hidden neurons in Ant and Halfcheetah, and the original fully connected MLP structure in the other two environments. Both the victims and the attackers are trained with independent value and policy optimizers by PPO. ... For the first policy π1 in eΠ, we train for 5 million steps (2441 iterations) in Ant and 2.5 million steps (1220 iterations) in the other three environments. ... We conduct a grid search of the optimal hyperparameters (including learning rates for the policy network and the adversary policy network, the ratio clip for PPO, and the entropy regularization) for each victim training method.