Generalized Proximal Policy Optimization with Sample Reuse
Authors: James Queeney, Yannis Paschalidis, Christos G Cassandras
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate both theoretically and empirically that our algorithm delivers improved performance by effectively balancing the competing goals of stability and sample efficiency. In addition to the theoretical support for our algorithm in the previous section, we aim to investigate the stability and sample efficiency of Ge PPO experimentally through simulations on several Mu Jo Co environments [21] in Open AI Gym [3]. |
| Researcher Affiliation | Academia | James Queeney Division of Systems Engineering Boston University jqueeney@bu.edu Ioannis Ch. Paschalidis Department of Electrical and Computer Engineering Division of Systems Engineering Boston University yannisp@bu.edu Christos G. Cassandras Department of Electrical and Computer Engineering Division of Systems Engineering Boston University cgc@bu.edu |
| Pseudocode | Yes | Algorithm 1: Generalized Proximal Policy Optimization with Sample Reuse (Ge PPO) |
| Open Source Code | Yes | Code available at https://github.com/jqueeney/geppo. |
| Open Datasets | Yes | experimentally through simulations on several Mu Jo Co environments [21] in Open AI Gym [3]. |
| Dataset Splits | No | The paper mentions batch sizes (e.g., 'the default batch size is N = 2,048') and sample collection processes. However, it does not specify explicit train/validation/test dataset splits from a pre-existing dataset with percentages or counts, as is typical in supervised learning, because data is generated through interaction with environments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or cloud computing instances. |
| Software Dependencies | No | The paper mentions software environments like 'Open AI Gym [3]' and 'Mu Jo Co environments [21]', but it does not provide specific version numbers for these or any other software dependencies, which would be necessary for reproducibility. |
| Experiment Setup | Yes | We represent the policy π as a multivariate Gaussian distribution, where the mean action for a given state is parameterized by a neural network with two hidden layers of 64 units each and tanh activations. The state-independent standard deviation is parameterized separately. The default value for the clipping parameter is ϵPPO = 0.2, and the default batch size is N = 2,048... The clipping parameter ϵGe PPO is chosen according to Lemma 4, which in our experiments results in ϵGe PPO = 0.1. |