On Oracle-Efficient PAC RL with Rich Observations
Authors: Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, Robert E. Schapire
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We present new provably sample-efficient algorithms for environments with deterministic hidden state dynamics and stochastic rich observations. These methods operate in an oracle model of computation accessing policy and value function classes exclusively through standard optimization primitives and therefore represent computationally efficient alternatives to prior algorithms that require enumeration. We also present several examples that illustrate fundamental challenges of tractable PAC reinforcement learning in such general settings. |
| Researcher Affiliation | Collaboration | Christoph Dann Carnegie Mellon University Pittsburgh, Pennsylvania cdann@cdann.net UIUC Urbana, Illinois nanjiang@illinois.edu Akshay Krishnamurthy Microsoft Research New York, New York akshay@cs.umass.edu Alekh Agarwal Microsoft Research Redmond, Washington alekha@microsoft.com John Langford Microsoft Research New York, New York jcl@microsoft.com Robert E. Schapire Microsoft Research New York, New York schapire@microsoft.com |
| Pseudocode | Yes | Algorithm 1: Main Algorithm VALOR, Algorithm 2: Subroutine: Policy optimization with local values, Algorithm 3: Subroutine: DFS Learning of local values |
| Open Source Code | No | The paper does not provide any explicit statements or links for open-source code availability for the described methodology. |
| Open Datasets | No | The paper is theoretical and focuses on algorithm design and proofs; it does not describe empirical experiments performed on a specific public dataset or provide access information for one. |
| Dataset Splits | No | The paper is theoretical and does not describe empirical experiments or dataset usage, thus no training/validation/test splits are provided. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for computations or experiments, consistent with its theoretical nature. |
| Software Dependencies | No | The paper mentions general optimization methods like 'standard continuous optimization methods' and 'The ellipsoid method [37]' but does not specify any particular software, libraries, or their version numbers used to implement or execute them. |
| Experiment Setup | No | The paper is theoretical and focuses on algorithm design; it does not describe empirical experimental setups, hyperparameter values, or training configurations. |