On Oracle-Efficient PAC RL with Rich Observations

Authors: Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, Robert E. Schapire

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We present new provably sample-efficient algorithms for environments with deterministic hidden state dynamics and stochastic rich observations. These methods operate in an oracle model of computation accessing policy and value function classes exclusively through standard optimization primitives and therefore represent computationally efficient alternatives to prior algorithms that require enumeration. We also present several examples that illustrate fundamental challenges of tractable PAC reinforcement learning in such general settings.
Researcher Affiliation Collaboration Christoph Dann Carnegie Mellon University Pittsburgh, Pennsylvania cdann@cdann.net UIUC Urbana, Illinois nanjiang@illinois.edu Akshay Krishnamurthy Microsoft Research New York, New York akshay@cs.umass.edu Alekh Agarwal Microsoft Research Redmond, Washington alekha@microsoft.com John Langford Microsoft Research New York, New York jcl@microsoft.com Robert E. Schapire Microsoft Research New York, New York schapire@microsoft.com
Pseudocode Yes Algorithm 1: Main Algorithm VALOR, Algorithm 2: Subroutine: Policy optimization with local values, Algorithm 3: Subroutine: DFS Learning of local values
Open Source Code No The paper does not provide any explicit statements or links for open-source code availability for the described methodology.
Open Datasets No The paper is theoretical and focuses on algorithm design and proofs; it does not describe empirical experiments performed on a specific public dataset or provide access information for one.
Dataset Splits No The paper is theoretical and does not describe empirical experiments or dataset usage, thus no training/validation/test splits are provided.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for computations or experiments, consistent with its theoretical nature.
Software Dependencies No The paper mentions general optimization methods like 'standard continuous optimization methods' and 'The ellipsoid method [37]' but does not specify any particular software, libraries, or their version numbers used to implement or execute them.
Experiment Setup No The paper is theoretical and focuses on algorithm design; it does not describe empirical experimental setups, hyperparameter values, or training configurations.