reproducibilityindex.ai

Robust Reinforcement Learning on State Observations with Learned Optimal Adversary

Authors: Huan Zhang, Hongge Chen, Duane S Boning, Cho-Jui Hsieh

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on a few continuous control environments show that ATLA achieves state-of-the-art performance under strong adversaries.
Researcher Affiliation	Academia	1Department of Computer Science, UCLA 2Department of EECS, MIT
Pseudocode	Yes	Algorithm 1 Learning an optimal adversary for perturbations on state observations
Open Source Code	Yes	Our code is available at https://github.com/huanzhang12/ATLA_robust_RL.
Open Datasets	Yes	In this section, we use PPO to train an adversary on four Open AI Gym Mu Jo Co continuous control environments.
Dataset Splits	No	No explicit training/validation/test dataset splits were provided, as is common in continuous control reinforcement learning where data is generated through interaction rather than from a static dataset. The paper describes training agents for a certain number of steps and evaluating performance over 50 episodes, and selecting agents based on median robustness from 21 runs.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, or cloud instances) used for running the experiments.
Software Dependencies	No	The paper mentions using 'Open AI Gym Mu Jo Co continuous control environments' and 'PPO as our policy optimizer' but does not specify version numbers for these or any other software dependencies.
Experiment Setup	Yes	Hyperparameters for ATLA-PPO For ATLA-PPO, we have hyperparameters for both agent and adversary. We keep all agent hyperparameters the same as those in vanilla MLP/LSTM agents, except for the entropy bonus coefﬁcient. We ﬁnd that sometimes we need a larger entropy bonus co-efﬁcient in ATLA to allow sufﬁcient exploration of the agent, as learning with an adversary is harder than learning in attack-free environments. For the adversary, we run a small-scale hyperparameter search on the learning rate of adversary policy and value networks, and the entropy bonus coefﬁcient for the adversary.