Generalized Proximal Policy Optimization with Sample Reuse

Authors: James Queeney, Yannis Paschalidis, Christos G Cassandras

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate both theoretically and empirically that our algorithm delivers improved performance by effectively balancing the competing goals of stability and sample efficiency. In addition to the theoretical support for our algorithm in the previous section, we aim to investigate the stability and sample efficiency of Ge PPO experimentally through simulations on several Mu Jo Co environments [21] in Open AI Gym [3].
Researcher Affiliation Academia James Queeney Division of Systems Engineering Boston University jqueeney@bu.edu Ioannis Ch. Paschalidis Department of Electrical and Computer Engineering Division of Systems Engineering Boston University yannisp@bu.edu Christos G. Cassandras Department of Electrical and Computer Engineering Division of Systems Engineering Boston University cgc@bu.edu
Pseudocode Yes Algorithm 1: Generalized Proximal Policy Optimization with Sample Reuse (Ge PPO)
Open Source Code Yes Code available at https://github.com/jqueeney/geppo.
Open Datasets Yes experimentally through simulations on several Mu Jo Co environments [21] in Open AI Gym [3].
Dataset Splits No The paper mentions batch sizes (e.g., 'the default batch size is N = 2,048') and sample collection processes. However, it does not specify explicit train/validation/test dataset splits from a pre-existing dataset with percentages or counts, as is typical in supervised learning, because data is generated through interaction with environments.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or cloud computing instances.
Software Dependencies No The paper mentions software environments like 'Open AI Gym [3]' and 'Mu Jo Co environments [21]', but it does not provide specific version numbers for these or any other software dependencies, which would be necessary for reproducibility.
Experiment Setup Yes We represent the policy π as a multivariate Gaussian distribution, where the mean action for a given state is parameterized by a neural network with two hidden layers of 64 units each and tanh activations. The state-independent standard deviation is parameterized separately. The default value for the clipping parameter is ϵPPO = 0.2, and the default batch size is N = 2,048... The clipping parameter ϵGe PPO is chosen according to Lemma 4, which in our experiments results in ϵGe PPO = 0.1.