Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Diverse Exploration via Conjugate Policies for Policy Gradient Methods
Authors: Andrew Cohen, Xingye Qiao, Lei Yu, Elliot Way, Xiangrong Tong3404-3411
AAAI 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results based on Trust Region Policy Optimization (TRPO) (Schulman et al. 2015) on three continuous control domains show that TRPO with DE significantly outperforms the baseline TRPO as well as TRPO with random perturbations. |
| Researcher Affiliation | Academia | Andrew Cohen Binghamton University EMAIL Xingye Qiao Binghamton University EMAIL Lei Yu Binghamton University Yantai University EMAIL Elliot Way Binghamton University EMAIL Xiangrong Tong Yantai University EMAIL |
| Pseudocode | Yes | Algorithm 1 DIVERSE EXPLORATION(π1, k, β, βk, δp) |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We display results on three difficult continuous control tasks, Hopper, Walker and Half Cheetah implemented in Open AI gym (Brockman et al. 2016) and using the Mujoco physics simulator (Todorov, Erez, and Tassa 2012). |
| Dataset Splits | No | The paper describes sample collection for policy improvement iterations but does not provide specific dataset split information (e.g., percentages or counts for train/validation/test sets) for the environments used. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions Open AI gym, Mujoco, and various neural network components but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | TRPO hyperparameters are taken from (Schulman et al. 2015; Duan et al. 2016). More specifically, we use k = 20 perturbations for Hopper and k = 40 perturbations for Walker and Half Cheetah for both DE and RP. For a total of N (N = 21000 for Hopper and N = 41000 for Walker and Half Cheetah in the reported results) samples collected in each policy improvement iteration, TRPO collects β = N samples per iteration while DE and RP collect β = βk = N k+1 samples from the main and each perturbed policy. The initial perturbation radius used in experiments is δp = .2 for Hopper and Half Cheetah and δp = .1 for Walker. |