reproducibilityindex.ai

Trust Region-Guided Proximal Policy Optimization

Authors: Yuhui Wang, Hao He, Xiaoyang Tan, Yaozhong Gan

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments verify the advantage of the proposed method. We conducted experiments to answer the following questions: (1) Does PPO suffer from the lack of exploration issue? (2) Could our TRGPPO relief the exploration issue and improve sample efﬁciency compared to PPO? (3) Does our TRGPPO maintain the stable learning property of PPO? To answer these questions, we ﬁrst evaluate the algorithms on two simple bandit problems and then compare them on high-dimensional benchmark tasks.
Researcher Affiliation	Academia	College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence Collaborative Innovation Center of Novel Software Technology and Industrialization {y.wang, hugo, x.tan, yzgancn}@nuaa.edu.cn
Pseudocode	Yes	Algorithm 1 Simpliﬁed Policy Iteration with PPO
Open Source Code	Yes	Source code is available at https://github.com/wangyuhuix/TRGPPO.
Open Datasets	Yes	We evaluate algorithms on benchmark tasks implemented in Open AI Gym [2], simulated by Mu Jo Co [21] and Arcade Learning Environment [1].
Dataset Splits	No	The paper mentions training policies and evaluating them after certain timesteps, but it does not specify explicit training, validation, or test dataset splits in terms of percentages or counts for model training.
Hardware Specification	Yes	All the wall-clock time reported in Section 6 is obtained on a server with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz with 40 CPUs, 256GB of memory and one GeForce GTX 1080 Ti GPU.
Software Dependencies	No	The paper states that implementations are 'based on Open AI Baselines [4]', but it does not specify explicit version numbers for this framework or any other software dependencies like Python, TensorFlow, or PyTorch, which would be necessary for full reproducibility.
Experiment Setup	Yes	For our TRGPPO, the trust region coefﬁcient δ is adaptively set by tuning ϵ (see Appendix B.4 for more detail). We set ϵ = 0.2, same as PPO. All tasks were run with 1 million timesteps except that the Humanoid task was 20 million timesteps. The trained policies are evaluated after sampling every 2048 timesteps data.