Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Gradient Information Matters in Policy Optimization by Back-propagating through Model
Authors: Chongchong Li, Yue Wang, Wei Chen, Yuting Liu, Zhi-Ming Ma, Tie-Yan Liu
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we empirically demonstrate the proposed algorithm has better sample efficiency when achieving a comparable or better performance on benchmark continuous control tasks. Codes are available at https://github.com/CCreal/ddppo |
| Researcher Affiliation | Collaboration | 1 Beijing Jiaotong University EMAIL 2 Microsoft Research Asia EMAIL 3 Institute of Computing Technology, Chinese Academy of Sciences EMAIL 4 Academy of Mathematics and Systems Science, Chinese Academy of Sciences EMAIL |
| Pseudocode | Yes | Algorithm 1 Directional Derivative Projection Policy Optimization |
| Open Source Code | Yes | Codes are available at https://github.com/CCreal/ddppo |
| Open Datasets | Yes | We evaluate our approach on six continuous control benchmark tasks in the Mu Jo Co (Todorov et al., 2012) simulator in our experiments: Inverted Pendulum-v2, Hopper-v2, Walker2d-v2, Half Cheetahv2, Ant-v2 and Humanoid-v2. |
| Dataset Splits | No | While the paper mentions and shows figures related to "validation" (e.g., "early stopping on a validation set" and "Predictive error on validation"), it does not explicitly provide specific details about the dataset splits (percentages, sample counts, or explicit standard split references) used for validation in their experiments, which would be needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only refers to "Mu Jo Co simulator environments". |
| Software Dependencies | No | The paper mentions "Mu Jo Co simulator" but does not specify its version number. It does not list any other software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experiment. |
| Experiment Setup | Yes | Table 1 shows the hyperparameters used for DDPPO results shown in Figure 1. Environment Name Inverted Pendulum Hopper Walker2D Half Cheetah Ant Humanoid epochs 15 100 100 100 150 150 environment steps /epoch 1000 ensemble size 7 G1 /environment step 10 G2 /environment step 10 H 3 2 3 n 10 25 25 5 w 10 50 0.1 1.0 0.1 0.1 |