reproducibilityindex.ai

Gradient Informed Proximal Policy Optimization

Authors: Sanghyun Son, Laura Zheng, Ryan Sullivan, Yi-Ling Qiao, Ming Lin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present experimental results that show our method s efficacy for various optimization and complex control problems. To validate our approach, we tested various baseline methods on the environments that we use.
Researcher Affiliation	Academia	Department of Computer Science University of Maryland, College Park
Pseudocode	Yes	In Algorithm 1, we present pseudocode that illustrates the outline of our algorithm, GI-PPO.
Open Source Code	Yes	Our code can be found online: https://github.com/Son Sang/gippo.
Open Datasets	Yes	We used Cartpole, Ant, and Hopper environments implemented by [Xu et al., 2022] for comparisons. We use De Jong s function and Ackley s function for comparison, as they are popular functions for testing numerical optimization algorithms [Molga and Smutnicki, 2005]. In this paper, we use the pace car problem, where a single autonomous pace car has to control the speed of the other vehicles via interference. The number of lanes, which represent the discontinuities in gradients, and the number of following human vehicles are different for each problem. Please see Appendix 7.5.2 for the details of this environment.
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits. It describes the experience collection process for RL, but not fixed dataset partitioning.
Hardware Specification	Yes	As for hardware, all experiments were run with an Intel Xeon W-2255 CPU @ 3.70GHz, one NVIDIA RTX A5000 graphics card, and 16 GB of memory.
Software Dependencies	Yes	We have implemented our learning method using Py Torch 1.9 Paszke et al. [2019].
Experiment Setup	Yes	In this section, we provide network architectures and hyperparameters that we used for experiments in Section 5. For each of the experiments, we used the same network architectures, the same length of time horizons before policy update, and the same optimization procedure for critic updates, etc.