Diverse Exploration via Conjugate Policies for Policy Gradient Methods
Authors: Andrew Cohen, Xingye Qiao, Lei Yu, Elliot Way, Xiangrong Tong3404-3411
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results based on Trust Region Policy Optimization (TRPO) (Schulman et al. 2015) on three continuous control domains show that TRPO with DE significantly outperforms the baseline TRPO as well as TRPO with random perturbations. |
| Researcher Affiliation | Academia | Andrew Cohen Binghamton University acohen13@binghamton.edu Xingye Qiao Binghamton University qiao@math.binghamton.edu Lei Yu Binghamton University Yantai University lyu@cs.binghamton.edu Elliot Way Binghamton University eway1@binghamton.edu Xiangrong Tong Yantai University txr@ytu.edu.cn |
| Pseudocode | Yes | Algorithm 1 DIVERSE EXPLORATION(π1, k, β, βk, δp) |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We display results on three difficult continuous control tasks, Hopper, Walker and Half Cheetah implemented in Open AI gym (Brockman et al. 2016) and using the Mujoco physics simulator (Todorov, Erez, and Tassa 2012). |
| Dataset Splits | No | The paper describes sample collection for policy improvement iterations but does not provide specific dataset split information (e.g., percentages or counts for train/validation/test sets) for the environments used. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions Open AI gym, Mujoco, and various neural network components but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | TRPO hyperparameters are taken from (Schulman et al. 2015; Duan et al. 2016). More specifically, we use k = 20 perturbations for Hopper and k = 40 perturbations for Walker and Half Cheetah for both DE and RP. For a total of N (N = 21000 for Hopper and N = 41000 for Walker and Half Cheetah in the reported results) samples collected in each policy improvement iteration, TRPO collects β = N samples per iteration while DE and RP collect β = βk = N k+1 samples from the main and each perturbed policy. The initial perturbation radius used in experiments is δp = .2 for Hopper and Half Cheetah and δp = .1 for Walker. |