Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to Explore via Meta-Policy Gradient
Authors: Tianbing Xu, Qiang Liu, Liang Zhao, Jian Peng
ICML 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With an extensive study, we show that our method significantly improves the sample-efficiency of DDPG on a variety of reinforcement learning continuous control tasks. |
| Researcher Affiliation | Collaboration | 1Baidu Research, Sunnyvale, CA 2University of Texas at Austin, TX 3University of Illinois at Urbana Champaign, IL. |
| Pseudocode | Yes | Algorithm 1 Teacher: Learn to Explore |
| Open Source Code | No | Our implementation is based on the Open AI s DDPG baseline (Plappert et al., 2017) Git Hub2. 2https://github.com/openai/baselines/tree/master/baselines/ddpg. This links to a third-party baseline, not the authors' own source code for their specific methodology. |
| Open Datasets | Yes | We have performed extensive experiments on several classic control and Mujoco (Todorov et al., 2012) tasks, including Hopper, Reacher, Half-Cheetah, Inverted Pendulum, Inverted Double Pendulum and Pendulum. |
| Dataset Splits | No | The paper refers to "evaluation steps" for D1 to evaluate student performance, but this is not explicitly framed as a distinct "validation set" for hyperparameter tuning or early stopping purposes, nor are typical validation split percentages provided. |
| Hardware Specification | Yes | Our experiments were performed on a server with 8 Tesla-M40-24GB GPU and 40 Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz processors. |
| Software Dependencies | No | The paper mentions specific optimizers (Adam) and normalization techniques (Layer-Normalization), but does not provide version numbers for these or for the core software frameworks (e.g., TensorFlow, PyTorch) used for implementation. |
| Experiment Setup | Yes | The parameter settings are: exploration rollout steps (typically 100) for generating exploration trajectories D0, number of evaluation steps (typically 200, same as DDPG s rollout steps) for generating exploitation trajectories D1 used to evaluate student s performance, number of training steps (typically 50, aligning with DDPG s training steps) to update student policy π, and number of exploration training steps (typically 1) to update the Meta policy πe. |