Diverse Exploration via Conjugate Policies for Policy Gradient Methods

Authors: Andrew Cohen, Xingye Qiao, Lei Yu, Elliot Way, Xiangrong Tong3404-3411

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results based on Trust Region Policy Optimization (TRPO) (Schulman et al. 2015) on three continuous control domains show that TRPO with DE significantly outperforms the baseline TRPO as well as TRPO with random perturbations.
Researcher Affiliation Academia Andrew Cohen Binghamton University acohen13@binghamton.edu Xingye Qiao Binghamton University qiao@math.binghamton.edu Lei Yu Binghamton University Yantai University lyu@cs.binghamton.edu Elliot Way Binghamton University eway1@binghamton.edu Xiangrong Tong Yantai University txr@ytu.edu.cn
Pseudocode Yes Algorithm 1 DIVERSE EXPLORATION(π1, k, β, βk, δp)
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes We display results on three difficult continuous control tasks, Hopper, Walker and Half Cheetah implemented in Open AI gym (Brockman et al. 2016) and using the Mujoco physics simulator (Todorov, Erez, and Tassa 2012).
Dataset Splits No The paper describes sample collection for policy improvement iterations but does not provide specific dataset split information (e.g., percentages or counts for train/validation/test sets) for the environments used.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions Open AI gym, Mujoco, and various neural network components but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes TRPO hyperparameters are taken from (Schulman et al. 2015; Duan et al. 2016). More specifically, we use k = 20 perturbations for Hopper and k = 40 perturbations for Walker and Half Cheetah for both DE and RP. For a total of N (N = 21000 for Hopper and N = 41000 for Walker and Half Cheetah in the reported results) samples collected in each policy improvement iteration, TRPO collects β = N samples per iteration while DE and RP collect β = βk = N k+1 samples from the main and each perturbed policy. The initial perturbation radius used in experiments is δp = .2 for Hopper and Half Cheetah and δp = .1 for Walker.