Learning to Explore via Meta-Policy Gradient
Authors: Tianbing Xu, Qiang Liu, Liang Zhao, Jian Peng
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | With an extensive study, we show that our method significantly improves the sample-efficiency of DDPG on a variety of reinforcement learning continuous control tasks. |
| Researcher Affiliation | Collaboration | 1Baidu Research, Sunnyvale, CA 2University of Texas at Austin, TX 3University of Illinois at Urbana Champaign, IL. |
| Pseudocode | Yes | Algorithm 1 Teacher: Learn to Explore |
| Open Source Code | No | Our implementation is based on the Open AI s DDPG baseline (Plappert et al., 2017) Git Hub2. 2https://github.com/openai/baselines/tree/master/baselines/ddpg. This links to a third-party baseline, not the authors' own source code for their specific methodology. |
| Open Datasets | Yes | We have performed extensive experiments on several classic control and Mujoco (Todorov et al., 2012) tasks, including Hopper, Reacher, Half-Cheetah, Inverted Pendulum, Inverted Double Pendulum and Pendulum. |
| Dataset Splits | No | The paper refers to "evaluation steps" for D1 to evaluate student performance, but this is not explicitly framed as a distinct "validation set" for hyperparameter tuning or early stopping purposes, nor are typical validation split percentages provided. |
| Hardware Specification | Yes | Our experiments were performed on a server with 8 Tesla-M40-24GB GPU and 40 Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz processors. |
| Software Dependencies | No | The paper mentions specific optimizers (Adam) and normalization techniques (Layer-Normalization), but does not provide version numbers for these or for the core software frameworks (e.g., TensorFlow, PyTorch) used for implementation. |
| Experiment Setup | Yes | The parameter settings are: exploration rollout steps (typically 100) for generating exploration trajectories D0, number of evaluation steps (typically 200, same as DDPG s rollout steps) for generating exploitation trajectories D1 used to evaluate student s performance, number of training steps (typically 50, aligning with DDPG s training steps) to update student policy π, and number of exploration training steps (typically 1) to update the Meta policy πe. |