POPGym: Benchmarking Partially Observable Reinforcement Learning
Authors: Steven Morad, Ryan Kortvelesy, Matteo Bettini, Stephan Liwicki, Amanda Prorok
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using POPGym, we execute the largest comparison across RL memory models to date. POPGym is available at https: //github.com/proroklab/popgym. |
| Researcher Affiliation | Collaboration | Steven Morad, Ryan Kortvelesy, Matteo Bettini, Stephan Liwicki, Amanda Prorok. Steven Morad and Stephan Liwicki gratefully acknowledge the support of Toshiba Europe Ltd. Ryan Kortvelesy was supported by Nokia Bell Labs through their donation for the Centre of Mobile, Wearable Systems and Augmented Intelligence to the University of Cambridge. |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | POPGym is available at https: //github.com/proroklab/popgym. |
| Open Datasets | Yes | POPGym is a collection of 15 partially observable gym environments (Figure 1)... All environments come with at least three difficulty settings and randomly generate levels to prevent overfitting. |
| Dataset Splits | No | We run three trials of each model over three difficulties for each environment, resulting in over 1700 trials. |
| Hardware Specification | Yes | Table 1: Frames per second (FPS) of our environments, computed on the Google Colab free tier and a Macbook Air (2020) laptop. We compute CPU statistics on a 3GHz Xeon Gold and GPU statistics on a 2080Ti, reporting the mean and 95% confidence interval over 10 trials. |
| Software Dependencies | No | We implement our high-level memory API and memory baselines on top of the popular RLlib framework, providing plug-and-play compatibility with various training algorithms, exploration strategies, and distributed training paradigms. We rewrite models where the existing implementation is slow, unreadable, not amenable to our API, or not written in Pytorch. |
| Experiment Setup | Yes | We present the full experimental parameters in Appendix A and detailed results for each environment and model in Appendix B. Table 2: PPO hyperparameters used in all of our experiments. HParam Value Decay factor γ 0.99 Value fn. loss coef. 1.0 Entropy loss coef. 0.0 Learning rate 5e-5 Num. SGD iters 30 Batch size 65536 Minibatch size 8192 GAE λ 1.0 KL target 0.01 KL coefficient 0.2 PPO clipping 0.3 Value clipping 0.3 BPTT Truncation Length Maximum Episode Length 1024 |