reproducibilityindex.ai

Diverse Exploration via Conjugate Policies for Policy Gradient Methods

Authors: Andrew Cohen, Xingye Qiao, Lei Yu, Elliot Way, Xiangrong Tong3404-3411

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results based on Trust Region Policy Optimization (TRPO) (Schulman et al. 2015) on three continuous control domains show that TRPO with DE signiﬁcantly outperforms the baseline TRPO as well as TRPO with random perturbations.
Researcher Affiliation	Academia	Andrew Cohen Binghamton University acohen13@binghamton.edu Xingye Qiao Binghamton University qiao@math.binghamton.edu Lei Yu Binghamton University Yantai University lyu@cs.binghamton.edu Elliot Way Binghamton University eway1@binghamton.edu Xiangrong Tong Yantai University txr@ytu.edu.cn
Pseudocode	Yes	Algorithm 1 DIVERSE EXPLORATION(π1, k, β, βk, δp)
Open Source Code	No	The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We display results on three difﬁcult continuous control tasks, Hopper, Walker and Half Cheetah implemented in Open AI gym (Brockman et al. 2016) and using the Mujoco physics simulator (Todorov, Erez, and Tassa 2012).
Dataset Splits	No	The paper describes sample collection for policy improvement iterations but does not provide specific dataset split information (e.g., percentages or counts for train/validation/test sets) for the environments used.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions Open AI gym, Mujoco, and various neural network components but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	TRPO hyperparameters are taken from (Schulman et al. 2015; Duan et al. 2016). More speciﬁcally, we use k = 20 perturbations for Hopper and k = 40 perturbations for Walker and Half Cheetah for both DE and RP. For a total of N (N = 21000 for Hopper and N = 41000 for Walker and Half Cheetah in the reported results) samples collected in each policy improvement iteration, TRPO collects β = N samples per iteration while DE and RP collect β = βk = N k+1 samples from the main and each perturbed policy. The initial perturbation radius used in experiments is δp = .2 for Hopper and Half Cheetah and δp = .1 for Walker.