Gradient Information Matters in Policy Optimization by Back-propagating through Model
Authors: Chongchong Li, Yue Wang, Wei Chen, Yuting Liu, Zhi-Ming Ma, Tie-Yan Liu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we empirically demonstrate the proposed algorithm has better sample efficiency when achieving a comparable or better performance on benchmark continuous control tasks. Codes are available at https://github.com/CCreal/ddppo |
| Researcher Affiliation | Collaboration | 1 Beijing Jiaotong University {18118002,ytliu}@bjtu.edu.cn 2 Microsoft Research Asia {yuwang5,tyliu}@microsoft.com 3 Institute of Computing Technology, Chinese Academy of Sciences chenwei2022@ict.ac.cn 4 Academy of Mathematics and Systems Science, Chinese Academy of Sciences mazm@amt.ac.cn |
| Pseudocode | Yes | Algorithm 1 Directional Derivative Projection Policy Optimization |
| Open Source Code | Yes | Codes are available at https://github.com/CCreal/ddppo |
| Open Datasets | Yes | We evaluate our approach on six continuous control benchmark tasks in the Mu Jo Co (Todorov et al., 2012) simulator in our experiments: Inverted Pendulum-v2, Hopper-v2, Walker2d-v2, Half Cheetahv2, Ant-v2 and Humanoid-v2. |
| Dataset Splits | No | While the paper mentions and shows figures related to "validation" (e.g., "early stopping on a validation set" and "Predictive error on validation"), it does not explicitly provide specific details about the dataset splits (percentages, sample counts, or explicit standard split references) used for validation in their experiments, which would be needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only refers to "Mu Jo Co simulator environments". |
| Software Dependencies | No | The paper mentions "Mu Jo Co simulator" but does not specify its version number. It does not list any other software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiment. |
| Experiment Setup | Yes | Table 1 shows the hyperparameters used for DDPPO results shown in Figure 1. Environment Name Inverted Pendulum Hopper Walker2D Half Cheetah Ant Humanoid epochs 15 100 100 100 150 150 environment steps /epoch 1000 ensemble size 7 G1 /environment step 10 G2 /environment step 10 H 3 2 3 n 10 25 25 5 w 10 50 0.1 1.0 0.1 0.1 |