Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability
Authors: Dibya Ghosh, Jad Rahme, Aviral Kumar, Amy Zhang, Ryan P. Adams, Sergey Levine
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that our simple algorithm derived from the epistemic POMDP achieves significant gains in generalization over current methods on the Procgen benchmark suite. |
| Researcher Affiliation | Collaboration | 1 UC Berkeley, 2 Princeton University, 3 Facebook AI Research. |
| Pseudocode | Yes | Algorithm 1 Linked Ensembles for the Epistemic POMDP (LEEP) |
| Open Source Code | No | The paper cites a third-party implementation of RL algorithms [58] (Pytorch implementations of reinforcement learning algorithms), which they used, but does not state that their own code for LEEP or their experiments is open-source or provide a link to it. |
| Open Datasets | Yes | The Procgen benchmark is a set of procedurally generated games, each with different generalization challenges. In each game, during training, the algorithm can interact with 200 training levels, before it is asked to generalize to the full distribution of levels. (cited [16]: K. Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. Ar Xiv, abs/1912.01588, 2020.) |
| Dataset Splits | No | The paper mentions 'training levels' and 'full distribution of levels' (implying a test set), but does not explicitly define a separate validation split or discuss how data was partitioned for validation purposes. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., specific GPU or CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions using PPO [54] and refers to Pytorch implementations [58], but it does not specify version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | We instantiate our method using an ensemble of n = 4 policies, a penalty parameter of α = 1, and PPO [54] to train the individual policies (full implementation details in Appendix C). |