Trust Region-Guided Proximal Policy Optimization

Authors: Yuhui Wang, Hao He, Xiaoyang Tan, Yaozhong Gan

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments verify the advantage of the proposed method. We conducted experiments to answer the following questions: (1) Does PPO suffer from the lack of exploration issue? (2) Could our TRGPPO relief the exploration issue and improve sample efficiency compared to PPO? (3) Does our TRGPPO maintain the stable learning property of PPO? To answer these questions, we first evaluate the algorithms on two simple bandit problems and then compare them on high-dimensional benchmark tasks.
Researcher Affiliation Academia College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence Collaborative Innovation Center of Novel Software Technology and Industrialization {y.wang, hugo, x.tan, yzgancn}@nuaa.edu.cn
Pseudocode Yes Algorithm 1 Simplified Policy Iteration with PPO
Open Source Code Yes Source code is available at https://github.com/wangyuhuix/TRGPPO.
Open Datasets Yes We evaluate algorithms on benchmark tasks implemented in Open AI Gym [2], simulated by Mu Jo Co [21] and Arcade Learning Environment [1].
Dataset Splits No The paper mentions training policies and evaluating them after certain timesteps, but it does not specify explicit training, validation, or test dataset splits in terms of percentages or counts for model training.
Hardware Specification Yes All the wall-clock time reported in Section 6 is obtained on a server with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz with 40 CPUs, 256GB of memory and one GeForce GTX 1080 Ti GPU.
Software Dependencies No The paper states that implementations are 'based on Open AI Baselines [4]', but it does not specify explicit version numbers for this framework or any other software dependencies like Python, TensorFlow, or PyTorch, which would be necessary for full reproducibility.
Experiment Setup Yes For our TRGPPO, the trust region coefficient δ is adaptively set by tuning ϵ (see Appendix B.4 for more detail). We set ϵ = 0.2, same as PPO. All tasks were run with 1 million timesteps except that the Humanoid task was 20 million timesteps. The trained policies are evaluated after sampling every 2048 timesteps data.