Robust Reinforcement Learning on State Observations with Learned Optimal Adversary
Authors: Huan Zhang, Hongge Chen, Duane S Boning, Cho-Jui Hsieh
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on a few continuous control environments show that ATLA achieves state-of-the-art performance under strong adversaries. |
| Researcher Affiliation | Academia | 1Department of Computer Science, UCLA 2Department of EECS, MIT |
| Pseudocode | Yes | Algorithm 1 Learning an optimal adversary for perturbations on state observations |
| Open Source Code | Yes | Our code is available at https://github.com/huanzhang12/ATLA_robust_RL. |
| Open Datasets | Yes | In this section, we use PPO to train an adversary on four Open AI Gym Mu Jo Co continuous control environments. |
| Dataset Splits | No | No explicit training/validation/test dataset splits were provided, as is common in continuous control reinforcement learning where data is generated through interaction rather than from a static dataset. The paper describes training agents for a certain number of steps and evaluating performance over 50 episodes, and selecting agents based on median robustness from 21 runs. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Open AI Gym Mu Jo Co continuous control environments' and 'PPO as our policy optimizer' but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Hyperparameters for ATLA-PPO For ATLA-PPO, we have hyperparameters for both agent and adversary. We keep all agent hyperparameters the same as those in vanilla MLP/LSTM agents, except for the entropy bonus coefficient. We find that sometimes we need a larger entropy bonus co-efficient in ATLA to allow sufficient exploration of the agent, as learning with an adversary is harder than learning in attack-free environments. For the adversary, we run a small-scale hyperparameter search on the learning rate of adversary policy and value networks, and the entropy bonus coefficient for the adversary. |