Curriculum-guided Hindsight Experience Replay
Authors: Meng Fang, Tianyi Zhou, Yali Du, Lei Han, Zhengyou Zhang
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate CHER and compare to state-of-the-art baselines on several challenging robotic manipulation tasks in simulated Mu Jo Co environments [Todorov et al., 2012]. In particular, we will use a simple Fetch environment as a toy example and Shadow Dexterous Hand environments from Open AI Gym [Brockman et al., 2016]. |
| Researcher Affiliation | Collaboration | 1Tencent Robotics X 2Paul G. Allen School of Computer Science & Engineering, University of Washington 3University College London |
| Pseudocode | Yes | Algorithm 1 STOCHASTIC-GREEDY(k, m, λ) and Algorithm 2 Curriculum-guided HER (CHER) are provided in the paper. |
| Open Source Code | Yes | Our code is available at https://github.com/mengf1/CHER. |
| Open Datasets | Yes | We evaluate CHER and compare to state-of-the-art baselines on several challenging robotic manipulation tasks in simulated Mu Jo Co environments [Todorov et al., 2012]. In particular, we will use a simple Fetch environment as a toy example and Shadow Dexterous Hand environments from Open AI Gym [Brockman et al., 2016]. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits with specific percentages or counts. It mentions training for 50 epochs and evaluating policies after each epoch with test rollouts, but no distinct validation split is specified. |
| Hardware Specification | No | The paper mentions: 'For all environments except Fetch Reach, we train policies on a single machine with 20 CPU cores. Each core generates experiences by using two parallel rollouts with MPI for synchronization.' It does not specify the CPU model or any GPU information. |
| Software Dependencies | No | The paper mentions using 'DDPG', 'HER', 'Open AI Gym', and 'Mu Jo Co environments' but does not provide specific version numbers for any of these software components. |
| Experiment Setup | Yes | For all environments except Fetch Reach, we train policies on a single machine with 20 CPU cores. Each core generates experiences by using two parallel rollouts with MPI for synchronization. We train each agent for 50 epochs with batch size 64. Hyperparameters are nearly the same as in Andrychowicz et al. [2017]. In CHER, we use |B| = 128, |A| = k = 64 and |b| = m = 3 for Algorithm 1. For the tasks in this paper, we use an exponentially increasing λ over the course of training, i.e., λ = (1 + η)γλ0, (7) where η [0, 1] is a learning pace controlling the progress of the curriculum, γ is the episode of the off-policy RL, and λ0 is the initial weight for proximity, which should be relatively small. |