Batch size-invariance for policy optimization
Authors: Jacob Hilton, Karl Cobbe, John Schulman
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments help explain why these algorithms work, and additionally show how they can make more efficient use of stale data. |
| Researcher Affiliation | Industry | Jacob Hilton Open AI jhilton@openai.com Open AI karl@openai.com John Schulman Open AI joschu@openai.com |
| Pseudocode | Yes | Pseudocode for PPO-EWMA may be found in Appendix A, and code may be found at https://github.com/ openai/ppo-ewma. |
| Open Source Code | Yes | Pseudocode for PPO-EWMA may be found in Appendix A, and code may be found at https://github.com/ openai/ppo-ewma. |
| Open Datasets | Yes | To validate our analysis, we ran several experiments on Procgen Benchmark [Cobbe et al., 2019] |
| Dataset Splits | No | The paper uses environments from Procgen Benchmark for experiments, which are used to generate experience on the fly rather than relying on predefined static dataset splits for training, validation, and testing. Specific dataset split percentages or counts are not provided. |
| Hardware Specification | No | The type of resources used is proprietary information. The paper does not provide specific details regarding the hardware used for experiments. |
| Software Dependencies | No | The paper mentions software components like 'Adam' and 'PPG' but does not specify version numbers for any libraries, frameworks, or environments used in the experiments. |
| Experiment Setup | Yes | Hyperparameters for all of our experiments can be found in Appendix B, and full results on each of the individual environments can be found in Appendix F. ... More specifically, to achieve batch size-invariance for PPO and PPG-EWMA, we make the following adjustments to compensate for the optimization and iteration batch sizes being divided by some constant c: Adjust the optimization hyperparameters as described in the previous section, i.e., divide the vanilla SGD learning rate by c or the Adam step size by pc. (We use Adam.) Modify βprox such that 1 1 βprox 1 is multiplied by c. ... If using advantage normalization, multiply the number of iterations used to estimate the advantage mean variance by c. ... For PPG, multiply the number of policy iterations per phase N by c. |