Trust Region-Guided Proximal Policy Optimization
Authors: Yuhui Wang, Hao He, Xiaoyang Tan, Yaozhong Gan
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments verify the advantage of the proposed method. We conducted experiments to answer the following questions: (1) Does PPO suffer from the lack of exploration issue? (2) Could our TRGPPO relief the exploration issue and improve sample efficiency compared to PPO? (3) Does our TRGPPO maintain the stable learning property of PPO? To answer these questions, we first evaluate the algorithms on two simple bandit problems and then compare them on high-dimensional benchmark tasks. |
| Researcher Affiliation | Academia | College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence Collaborative Innovation Center of Novel Software Technology and Industrialization {y.wang, hugo, x.tan, yzgancn}@nuaa.edu.cn |
| Pseudocode | Yes | Algorithm 1 Simplified Policy Iteration with PPO |
| Open Source Code | Yes | Source code is available at https://github.com/wangyuhuix/TRGPPO. |
| Open Datasets | Yes | We evaluate algorithms on benchmark tasks implemented in Open AI Gym [2], simulated by Mu Jo Co [21] and Arcade Learning Environment [1]. |
| Dataset Splits | No | The paper mentions training policies and evaluating them after certain timesteps, but it does not specify explicit training, validation, or test dataset splits in terms of percentages or counts for model training. |
| Hardware Specification | Yes | All the wall-clock time reported in Section 6 is obtained on a server with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz with 40 CPUs, 256GB of memory and one GeForce GTX 1080 Ti GPU. |
| Software Dependencies | No | The paper states that implementations are 'based on Open AI Baselines [4]', but it does not specify explicit version numbers for this framework or any other software dependencies like Python, TensorFlow, or PyTorch, which would be necessary for full reproducibility. |
| Experiment Setup | Yes | For our TRGPPO, the trust region coefficient δ is adaptively set by tuning ϵ (see Appendix B.4 for more detail). We set ϵ = 0.2, same as PPO. All tasks were run with 1 million timesteps except that the Humanoid task was 20 million timesteps. The trained policies are evaluated after sampling every 2048 timesteps data. |