Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies
Authors: Xiangyu Liu, Chenghao Deng, Yanchao Sun, Yongyuan Liang, Furong Huang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios. ... (3) Empirical investigations. Through empirical studies on Mujoco, we validate the effectiveness of PROTECTED, demonstrating both improved natural performance and robustness, as well as adaptability against unknown and dynamic attacks. |
| Researcher Affiliation | Collaboration | Xiangyu Liu 1, Chenghao Deng 1, Yanchao Sun2, Yongyuan Liang1, Furong Huang1 1University of Maryland, College Park, 2J.P. Morgan AI Research |
| Pseudocode | Yes | Algorithm 1 Online adaptation with refined policy class... Algorithm 2 Iterative discovery of non-dominated policy class |
| Open Source Code | Yes | Codes are available at https://github.com/umd-huang-lab/PROTECTED.git |
| Open Datasets | Yes | For empirical studies, we implement our framework in four Mujoco environments with continuous action spaces, specifically, Hopper, Walker2d, Halfcheetah, and Ant, adhering to a setup similar to most related works (Zhang et al., 2020a; 2021; Sun et al., 2019; Liang et al., 2022). |
| Dataset Splits | No | The paper mentions using Mujoco environments and details training steps and evaluation over a number of episodes, but it does not specify explicit train/validation/test dataset splits with percentages, sample counts, or references to predefined splits. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA Ge Force RTX 2080 Ti GPU. |
| Software Dependencies | No | The paper mentions 'PPO' as an optimizer and 'Mujoco environments' but does not specify software names with version numbers for libraries, frameworks, or operating systems (e.g., Python version, PyTorch/TensorFlow version, Mujoco version). |
| Experiment Setup | Yes | For the network structure, we employ a single-layer LSTM with 64 hidden neurons in Ant and Halfcheetah, and the original fully connected MLP structure in the other two environments. Both the victims and the attackers are trained with independent value and policy optimizers by PPO. ... For the first policy π1 in eΠ, we train for 5 million steps (2441 iterations) in Ant and 2.5 million steps (1220 iterations) in the other three environments. ... We conduct a grid search of the optimal hyperparameters (including learning rates for the policy network and the adversary policy network, the ratio clip for PPO, and the entropy regularization) for each victim training method. |