Online Robust Policy Learning in the Presence of Unknown Adversaries

Authors: Aaron Havens, Zhanhong Jiang, Soumik Sarkar

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that the proposed algorithm enables policy learning with significantly lower bias as compared to the state-of-the-art policy learning approaches even in the presence of heavy state information attacks. We present algorithm analysis and simulation results using popular Open AI Gym environments.
Researcher Affiliation Academia Aaron J. Havens, Zhanhong Jiang, Soumik Sarkar Department of Mechanical Engineering Iowa State University Ames, IA 50011 {ajhavens,zhjiang,soumiks}@iastate.edu
Pseudocode Yes Algorithm 1: MLAH Input :πnom and πadv sub-policies parameterized by θnom and θadv; Master policy πmaster with parameter vector φ. 1 Initialize θnom, θadv, φ 2 for pre-training iterations [optional] do 3 Train πnom and θnom on only nominal experiences. 4 end 5 for learning life-time do 6 for Time steps t to t + T do 7 Compute At over sub-policies (see eq. 4) 8 πmaster selects to switch or stay with sub-policy based on At observations to take action 9 end 10 Estimate all AGAE for πnom, πadv over T 11 Estimate all AGAE for πmaster over T with respect to At observations 12 Optimize θnom based on experiences collected from πnom 13 Optimize θadv based on experiences collected from πadv 14 Optimize φ based on all experiences with respect to At observations
Open Source Code Yes The source code is available on https://github.com/Aaron Havens/safe_rl.
Open Datasets Yes We present algorithm analysis and simulation results using popular Open AI Gym environments.
Dataset Splits No The paper mentions 'training' and 'evaluation' but does not provide specific details on dataset splits (e.g., percentages or counts for train/validation/test sets) or cross-validation setup beyond general usage of 'Open AI Gym environments'.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or processing power) used for running the experiments, only mentioning 'simulation results using popular Open AI Gym environments'.
Software Dependencies No The paper mentions using PPO [17] and Open AI Gym environments [14], but it does not specify version numbers for any software dependencies like Python, PyTorch/TensorFlow, or specific library versions, which are necessary for full reproducibility.
Experiment Setup No The paper states, 'For page limit constraints, PPO parameters used in experiments such as deep network size and actor-batches can be found in the supplementary material,' meaning specific hyperparameters are not provided in the main text.