When Is Generalizable Reinforcement Learning Tractable?
Authors: Dhruv Malik, Yuanzhi Li, Pradeep Ravikumar
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our Contributions. We introduce Weak Proximity, a natural structural condition that is motivated by classical RL results, and requires the environments to have highly similar transition and reward functions and share optimal trajectories. We prove a statistical lower bound demonstrating that tractable generalization is impossible, despite this shared structure. This lower bound holds even when each individual environment can be efficiently solved to obtain an optimal linear policy, and when the agent possesses a generative model. Consequentially, we show that a classical metric for measuring the relative closeness of MDPs is not the right metric for modern RL generalization settings. Our lower bound implies that learning a state representation for the purpose of efficiently generalizing to multiple environments, is worst case sample inefficient even when such a representation exists, the environments are ostensibly similar, and any single environment can be efficiently solved. To provide a sufficient condition for efficient generalization, we introduce Strong Proximity. This structural condition strengthens Weak Proximity by additionally constraining the environments to share an optimal policy. We provide an algorithm which exploits Strong Proximity to provably and efficiently generalize, when the environments share deterministic transitions. and from the checklist: '3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] No experiments were run.' |
| Researcher Affiliation | Academia | Dhruv Malik Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 Yuanzhi Li Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 Pradeep Ravikumar Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 |
| Pseudocode | Yes | Algorithm 1 Inputs: horizon length H, distribution D, sample size n, oracle b V as defined in WIO |
| Open Source Code | No | The paper states under '3. If you ran experiments...' and '4. If you are using existing assets...' that 'No experiments were run' and 'No such assets were used or created', which implies no custom code for the methodology was released. |
| Open Datasets | No | The paper states 'No experiments were run', indicating no dataset was used for training. |
| Dataset Splits | No | The paper states 'No experiments were run', indicating no dataset was used for validation. |
| Hardware Specification | No | The paper explicitly states 'No experiments were run', meaning no hardware was used for experiments and thus no specifications are provided. |
| Software Dependencies | No | The paper explicitly states 'No experiments were run', meaning no software dependencies for experiments are relevant or provided. |
| Experiment Setup | No | The paper explicitly states 'No experiments were run', so no experimental setup details like hyperparameters are provided. |