Gradient Informed Proximal Policy Optimization

Authors: Sanghyun Son, Laura Zheng, Ryan Sullivan, Yi-Ling Qiao, Ming Lin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present experimental results that show our method s efficacy for various optimization and complex control problems. To validate our approach, we tested various baseline methods on the environments that we use.
Researcher Affiliation Academia Department of Computer Science University of Maryland, College Park
Pseudocode Yes In Algorithm 1, we present pseudocode that illustrates the outline of our algorithm, GI-PPO.
Open Source Code Yes Our code can be found online: https://github.com/Son Sang/gippo.
Open Datasets Yes We used Cartpole, Ant, and Hopper environments implemented by [Xu et al., 2022] for comparisons. We use De Jong s function and Ackley s function for comparison, as they are popular functions for testing numerical optimization algorithms [Molga and Smutnicki, 2005]. In this paper, we use the pace car problem, where a single autonomous pace car has to control the speed of the other vehicles via interference. The number of lanes, which represent the discontinuities in gradients, and the number of following human vehicles are different for each problem. Please see Appendix 7.5.2 for the details of this environment.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits. It describes the experience collection process for RL, but not fixed dataset partitioning.
Hardware Specification Yes As for hardware, all experiments were run with an Intel Xeon W-2255 CPU @ 3.70GHz, one NVIDIA RTX A5000 graphics card, and 16 GB of memory.
Software Dependencies Yes We have implemented our learning method using Py Torch 1.9 Paszke et al. [2019].
Experiment Setup Yes In this section, we provide network architectures and hyperparameters that we used for experiments in Section 5. For each of the experiments, we used the same network architectures, the same length of time horizons before policy update, and the same optimization procedure for critic updates, etc.