POPGym: Benchmarking Partially Observable Reinforcement Learning

Authors: Steven Morad, Ryan Kortvelesy, Matteo Bettini, Stephan Liwicki, Amanda Prorok

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using POPGym, we execute the largest comparison across RL memory models to date. POPGym is available at https: //github.com/proroklab/popgym.
Researcher Affiliation Collaboration Steven Morad, Ryan Kortvelesy, Matteo Bettini, Stephan Liwicki, Amanda Prorok. Steven Morad and Stephan Liwicki gratefully acknowledge the support of Toshiba Europe Ltd. Ryan Kortvelesy was supported by Nokia Bell Labs through their donation for the Centre of Mobile, Wearable Systems and Augmented Intelligence to the University of Cambridge.
Pseudocode No No explicit pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes POPGym is available at https: //github.com/proroklab/popgym.
Open Datasets Yes POPGym is a collection of 15 partially observable gym environments (Figure 1)... All environments come with at least three difficulty settings and randomly generate levels to prevent overfitting.
Dataset Splits No We run three trials of each model over three difficulties for each environment, resulting in over 1700 trials.
Hardware Specification Yes Table 1: Frames per second (FPS) of our environments, computed on the Google Colab free tier and a Macbook Air (2020) laptop. We compute CPU statistics on a 3GHz Xeon Gold and GPU statistics on a 2080Ti, reporting the mean and 95% confidence interval over 10 trials.
Software Dependencies No We implement our high-level memory API and memory baselines on top of the popular RLlib framework, providing plug-and-play compatibility with various training algorithms, exploration strategies, and distributed training paradigms. We rewrite models where the existing implementation is slow, unreadable, not amenable to our API, or not written in Pytorch.
Experiment Setup Yes We present the full experimental parameters in Appendix A and detailed results for each environment and model in Appendix B. Table 2: PPO hyperparameters used in all of our experiments. HParam Value Decay factor γ 0.99 Value fn. loss coef. 1.0 Entropy loss coef. 0.0 Learning rate 5e-5 Num. SGD iters 30 Batch size 65536 Minibatch size 8192 GAE λ 1.0 KL target 0.01 KL coefficient 0.2 PPO clipping 0.3 Value clipping 0.3 BPTT Truncation Length Maximum Episode Length 1024