reproducibilityindex.ai

Batch size-invariance for policy optimization

Authors: Jacob Hilton, Karl Cobbe, John Schulman

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments help explain why these algorithms work, and additionally show how they can make more efﬁcient use of stale data.
Researcher Affiliation	Industry	Jacob Hilton Open AI jhilton@openai.com Open AI karl@openai.com John Schulman Open AI joschu@openai.com
Pseudocode	Yes	Pseudocode for PPO-EWMA may be found in Appendix A, and code may be found at https://github.com/ openai/ppo-ewma.
Open Source Code	Yes	Pseudocode for PPO-EWMA may be found in Appendix A, and code may be found at https://github.com/ openai/ppo-ewma.
Open Datasets	Yes	To validate our analysis, we ran several experiments on Procgen Benchmark [Cobbe et al., 2019]
Dataset Splits	No	The paper uses environments from Procgen Benchmark for experiments, which are used to generate experience on the fly rather than relying on predefined static dataset splits for training, validation, and testing. Specific dataset split percentages or counts are not provided.
Hardware Specification	No	The type of resources used is proprietary information. The paper does not provide specific details regarding the hardware used for experiments.
Software Dependencies	No	The paper mentions software components like 'Adam' and 'PPG' but does not specify version numbers for any libraries, frameworks, or environments used in the experiments.
Experiment Setup	Yes	Hyperparameters for all of our experiments can be found in Appendix B, and full results on each of the individual environments can be found in Appendix F. ... More speciﬁcally, to achieve batch size-invariance for PPO and PPG-EWMA, we make the following adjustments to compensate for the optimization and iteration batch sizes being divided by some constant c: Adjust the optimization hyperparameters as described in the previous section, i.e., divide the vanilla SGD learning rate by c or the Adam step size by pc. (We use Adam.) Modify βprox such that 1 1 βprox 1 is multiplied by c. ... If using advantage normalization, multiply the number of iterations used to estimate the advantage mean variance by c. ... For PPG, multiply the number of policy iterations per phase N by c.