Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach

Authors: James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras9377-9385

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the robust performance of our approach through experiments on high-dimensional continuous control tasks in Open AI Gym s Mu Jo Co environments (Brockman et al. 2016; Todorov, Erez, and Tassa 2012).
Researcher Affiliation Academia James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras Division of Systems Engineering, Boston University, Boston, MA 02215 {jqueeney, yannisp, cgc}@bu.edu
Pseudocode Yes Algorithm 1: Uncertainty-Aware Update Direction Algorithm 2: Uncertainty-Aware Trust Region Policy Optimization (UA-TRPO)
Open Source Code No The paper references third-party tools like Open AI Baselines and Open AI Gym, but does not provide an explicit link or statement about the availability of their own source code for the methodology described.
Open Datasets Yes We focus on the locomotion tasks in Open AI Baselines (Dhariwal et al. 2017) Mu Jo Co1M benchmark set (Swimmer-v3, Hopper-v3, Half Cheetah-v3, and Walker2d-v3), all of which have continuous, highdimensional state and action spaces.
Dataset Splits No The paper describes how samples are collected for policy updates (e.g., '1,000 steps in our experiments'), but it does not specify traditional training, validation, and test dataset splits with percentages or counts for a static dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions 'Open AI Gym' and 'Mu Jo Co environments' as the experimental platforms but does not specify their version numbers or any other software dependencies with versions.
Experiment Setup Yes We use the hyperparameters from Henderson et al. (2018) for our implementation of TRPO, which includes δKL = 0.01 (see Equation (5)). For UA-TRPO, we use δUA = 0.03, c = 6e 4, and α = 0.05 for the inputs to Algorithm 2 in all of our experiments. We determined δUA through cross validation, where the trade-off parameter c was chosen so that on average the KL divergence between consecutive policies is the same as in TRPO to provide a fair comparison.