Contextual Decision Processes with low Bellman rank are PAC-Learnable

Authors: Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, Robert E. Schapire

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper studies systematic exploration for reinforcement learning (RL) with rich observations and function approximation. We introduce contextual decision processes (CDPs), that unify most prior RL settings. Our first contribution is a complexity measure, the Bellman rank, that we show enables tractable learning of near-optimal behavior in CDPs and is naturally small for many well-studied RL models. Our second contribution is a new RL algorithm that does systematic exploration to learn near-optimal behavior in CDPs with low Bellman rank. The algorithm requires a number of samples that is polynomial in all relevant parameters but independent of the number of unique contexts. Our approach uses Bellman error minimization with optimistic exploration and provides new insights into efficient exploration for RL with function approximation. We prove that OLIVE performs sample-efficient learning in CDPs with a small Bellman rank (Section 4.2).
Researcher Affiliation Collaboration 1University of Michigan, Ann Arbor 2University of Massachusetts, Amherst 3Microsoft Research, New York.
Pseudocode Yes Pseudocode for our algorithm, OLIVE (Optimism Led Iterative Value-function Elimination), is displayed in Algorithm 1.
Open Source Code No The paper does not contain any explicit statements about providing open-source code or links to a code repository for the methodology described.
Open Datasets No The paper is theoretical and does not involve experimental evaluation on a specific dataset, therefore no information about public dataset access for training is provided.
Dataset Splits No The paper is theoretical and does not involve experimental evaluation on a specific dataset. Therefore, it does not provide details on training/validation/test dataset splits.
Hardware Specification No The paper is theoretical and focuses on algorithm design and theoretical guarantees. It does not mention any specific hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not specify any software dependencies with version numbers for implementation or experimentation.
Experiment Setup No The paper is theoretical and does not provide practical experimental setup details, hyperparameters, or training configurations. It defines parameters for its algorithm (nest, neval, n, φ) derived from theoretical analysis, but these are not for empirical setup.