Learning to Explore via Meta-Policy Gradient

Authors: Tianbing Xu, Qiang Liu, Liang Zhao, Jian Peng

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With an extensive study, we show that our method significantly improves the sample-efficiency of DDPG on a variety of reinforcement learning continuous control tasks.
Researcher Affiliation Collaboration 1Baidu Research, Sunnyvale, CA 2University of Texas at Austin, TX 3University of Illinois at Urbana Champaign, IL.
Pseudocode Yes Algorithm 1 Teacher: Learn to Explore
Open Source Code No Our implementation is based on the Open AI s DDPG baseline (Plappert et al., 2017) Git Hub2. 2https://github.com/openai/baselines/tree/master/baselines/ddpg. This links to a third-party baseline, not the authors' own source code for their specific methodology.
Open Datasets Yes We have performed extensive experiments on several classic control and Mujoco (Todorov et al., 2012) tasks, including Hopper, Reacher, Half-Cheetah, Inverted Pendulum, Inverted Double Pendulum and Pendulum.
Dataset Splits No The paper refers to "evaluation steps" for D1 to evaluate student performance, but this is not explicitly framed as a distinct "validation set" for hyperparameter tuning or early stopping purposes, nor are typical validation split percentages provided.
Hardware Specification Yes Our experiments were performed on a server with 8 Tesla-M40-24GB GPU and 40 Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz processors.
Software Dependencies No The paper mentions specific optimizers (Adam) and normalization techniques (Layer-Normalization), but does not provide version numbers for these or for the core software frameworks (e.g., TensorFlow, PyTorch) used for implementation.
Experiment Setup Yes The parameter settings are: exploration rollout steps (typically 100) for generating exploration trajectories D0, number of evaluation steps (typically 200, same as DDPG s rollout steps) for generating exploitation trajectories D1 used to evaluate student s performance, number of training steps (typically 50, aligning with DDPG s training steps) to update student policy π, and number of exploration training steps (typically 1) to update the Meta policy πe.