reproducibilityindex.ai

Learning to Explore via Meta-Policy Gradient

Authors: Tianbing Xu, Qiang Liu, Liang Zhao, Jian Peng

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With an extensive study, we show that our method signiﬁcantly improves the sample-efﬁciency of DDPG on a variety of reinforcement learning continuous control tasks.
Researcher Affiliation	Collaboration	1Baidu Research, Sunnyvale, CA 2University of Texas at Austin, TX 3University of Illinois at Urbana Champaign, IL.
Pseudocode	Yes	Algorithm 1 Teacher: Learn to Explore
Open Source Code	No	Our implementation is based on the Open AI s DDPG baseline (Plappert et al., 2017) Git Hub2. 2https://github.com/openai/baselines/tree/master/baselines/ddpg. This links to a third-party baseline, not the authors' own source code for their specific methodology.
Open Datasets	Yes	We have performed extensive experiments on several classic control and Mujoco (Todorov et al., 2012) tasks, including Hopper, Reacher, Half-Cheetah, Inverted Pendulum, Inverted Double Pendulum and Pendulum.
Dataset Splits	No	The paper refers to "evaluation steps" for D1 to evaluate student performance, but this is not explicitly framed as a distinct "validation set" for hyperparameter tuning or early stopping purposes, nor are typical validation split percentages provided.
Hardware Specification	Yes	Our experiments were performed on a server with 8 Tesla-M40-24GB GPU and 40 Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz processors.
Software Dependencies	No	The paper mentions specific optimizers (Adam) and normalization techniques (Layer-Normalization), but does not provide version numbers for these or for the core software frameworks (e.g., TensorFlow, PyTorch) used for implementation.
Experiment Setup	Yes	The parameter settings are: exploration rollout steps (typically 100) for generating exploration trajectories D0, number of evaluation steps (typically 200, same as DDPG s rollout steps) for generating exploitation trajectories D1 used to evaluate student s performance, number of training steps (typically 50, aligning with DDPG s training steps) to update student policy π, and number of exploration training steps (typically 1) to update the Meta policy πe.