Batch size-invariance for policy optimization

Authors: Jacob Hilton, Karl Cobbe, John Schulman

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data.
Researcher Affiliation Industry Jacob Hilton Open AI jhilton@openai.com Open AI karl@openai.com John Schulman Open AI joschu@openai.com
Pseudocode Yes Pseudocode for PPO-EWMA may be found in Appendix A, and code may be found at https://github.com/ openai/ppo-ewma.
Open Source Code Yes Pseudocode for PPO-EWMA may be found in Appendix A, and code may be found at https://github.com/ openai/ppo-ewma.
Open Datasets Yes To validate our analysis, we ran several experiments on Procgen Benchmark [Cobbe et al., 2019]
Dataset Splits No The paper uses environments from Procgen Benchmark for experiments, which are used to generate experience on the fly rather than relying on predefined static dataset splits for training, validation, and testing. Specific dataset split percentages or counts are not provided.
Hardware Specification No The type of resources used is proprietary information. The paper does not provide specific details regarding the hardware used for experiments.
Software Dependencies No The paper mentions software components like 'Adam' and 'PPG' but does not specify version numbers for any libraries, frameworks, or environments used in the experiments.
Experiment Setup Yes Hyperparameters for all of our experiments can be found in Appendix B, and full results on each of the individual environments can be found in Appendix F. ... More specifically, to achieve batch size-invariance for PPO and PPG-EWMA, we make the following adjustments to compensate for the optimization and iteration batch sizes being divided by some constant c: Adjust the optimization hyperparameters as described in the previous section, i.e., divide the vanilla SGD learning rate by c or the Adam step size by pc. (We use Adam.) Modify βprox such that 1 1 βprox 1 is multiplied by c. ... If using advantage normalization, multiply the number of iterations used to estimate the advantage mean variance by c. ... For PPG, multiply the number of policy iterations per phase N by c.