Gradient Informed Proximal Policy Optimization
Authors: Sanghyun Son, Laura Zheng, Ryan Sullivan, Yi-Ling Qiao, Ming Lin
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present experimental results that show our method s efficacy for various optimization and complex control problems. To validate our approach, we tested various baseline methods on the environments that we use. |
| Researcher Affiliation | Academia | Department of Computer Science University of Maryland, College Park |
| Pseudocode | Yes | In Algorithm 1, we present pseudocode that illustrates the outline of our algorithm, GI-PPO. |
| Open Source Code | Yes | Our code can be found online: https://github.com/Son Sang/gippo. |
| Open Datasets | Yes | We used Cartpole, Ant, and Hopper environments implemented by [Xu et al., 2022] for comparisons. We use De Jong s function and Ackley s function for comparison, as they are popular functions for testing numerical optimization algorithms [Molga and Smutnicki, 2005]. In this paper, we use the pace car problem, where a single autonomous pace car has to control the speed of the other vehicles via interference. The number of lanes, which represent the discontinuities in gradients, and the number of following human vehicles are different for each problem. Please see Appendix 7.5.2 for the details of this environment. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits. It describes the experience collection process for RL, but not fixed dataset partitioning. |
| Hardware Specification | Yes | As for hardware, all experiments were run with an Intel Xeon W-2255 CPU @ 3.70GHz, one NVIDIA RTX A5000 graphics card, and 16 GB of memory. |
| Software Dependencies | Yes | We have implemented our learning method using Py Torch 1.9 Paszke et al. [2019]. |
| Experiment Setup | Yes | In this section, we provide network architectures and hyperparameters that we used for experiments in Section 5. For each of the experiments, we used the same network architectures, the same length of time horizons before policy update, and the same optimization procedure for critic updates, etc. |