Gradient Information Matters in Policy Optimization by Back-propagating through Model

Authors: Chongchong Li, Yue Wang, Wei Chen, Yuting Liu, Zhi-Ming Ma, Tie-Yan Liu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically demonstrate the proposed algorithm has better sample efficiency when achieving a comparable or better performance on benchmark continuous control tasks. Codes are available at https://github.com/CCreal/ddppo
Researcher Affiliation Collaboration 1 Beijing Jiaotong University {18118002,ytliu}@bjtu.edu.cn 2 Microsoft Research Asia {yuwang5,tyliu}@microsoft.com 3 Institute of Computing Technology, Chinese Academy of Sciences chenwei2022@ict.ac.cn 4 Academy of Mathematics and Systems Science, Chinese Academy of Sciences mazm@amt.ac.cn
Pseudocode Yes Algorithm 1 Directional Derivative Projection Policy Optimization
Open Source Code Yes Codes are available at https://github.com/CCreal/ddppo
Open Datasets Yes We evaluate our approach on six continuous control benchmark tasks in the Mu Jo Co (Todorov et al., 2012) simulator in our experiments: Inverted Pendulum-v2, Hopper-v2, Walker2d-v2, Half Cheetahv2, Ant-v2 and Humanoid-v2.
Dataset Splits No While the paper mentions and shows figures related to "validation" (e.g., "early stopping on a validation set" and "Predictive error on validation"), it does not explicitly provide specific details about the dataset splits (percentages, sample counts, or explicit standard split references) used for validation in their experiments, which would be needed to reproduce the data partitioning.
Hardware Specification No The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only refers to "Mu Jo Co simulator environments".
Software Dependencies No The paper mentions "Mu Jo Co simulator" but does not specify its version number. It does not list any other software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiment.
Experiment Setup Yes Table 1 shows the hyperparameters used for DDPPO results shown in Figure 1. Environment Name Inverted Pendulum Hopper Walker2D Half Cheetah Ant Humanoid epochs 15 100 100 100 150 150 environment steps /epoch 1000 ensemble size 7 G1 /environment step 10 G2 /environment step 10 H 3 2 3 n 10 25 25 5 w 10 50 0.1 1.0 0.1 0.1