reproducibilityindex.ai

Model-Free Reinforcement Learning with the Decision-Estimation Coefficient

Authors: Dylan J Foster, Noah Golowich, Jian Qian, Alexander Rakhlin, Ayush Sekhari

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We consider the problem of interactive decision making, encompassing structured bandits and reinforcement learning with general function approximation. Recently, Foster et al. [12] introduced the Decision-Estimation Coefﬁcient, a measure of statistical complexity that lower bounds the optimal regret for interactive decision making, as well as a meta-algorithm, Estimation-to-Decisions, which achieves upper bounds in terms of the same quantity. Estimation-to-Decisions is a reduction, which lifts algorithms for (supervised) online estimation into algorithms for decision making. In this paper, we show that by combining Estimation-to-Decisions with a specialized form of optimistic estimation introduced by Zhang [31], it is possible to obtain guarantees that improve upon those of Foster et al. [12] by accommodating more lenient notions of estimation error. We use this approach to derive regret bounds for model-free reinforcement learning with value function approximation, and give structural results showing when it can and cannot help more generally.
Researcher Affiliation	Collaboration	Dylan J. Foster dylanfoster@microsoft.com Noah Golowich nzg@mit.edu Jian Qian jianqian@mit.edu Alexander Rakhlin rakhlin@mit.edu Ayush Sekhari sekhari@mit.edu
Pseudocode	Yes	Algorithm 1 Estimation-to-Decisions (E2D) for General Divergences; Algorithm 2 Optimistic Estimation-to-Decisions (E2D.Opt); Algorithm 3 Optimistic Estimation-to-Decisions (E2D.Opt) with Batching; Algorithm 4 Two-Timescale Exponential Weights for Bellman Complete Value Function Classes; Algorithm 5 Two-Timescale Exponential Weights (adapted from Agarwal and Zhang [2]); Algorithm 6 Optimistic Estimation for Bilinear Classes
Open Source Code	No	The paper does not provide any links to open-source code or explicitly state that code for the described methodology is released.
Open Datasets	No	This paper is theoretical, focusing on deriving regret bounds and structural results for reinforcement learning. It does not conduct empirical studies on specific datasets, therefore, it does not mention training datasets or their availability.
Dataset Splits	No	This paper is theoretical and does not describe empirical experiments with data splits for training, validation, or testing.
Hardware Specification	No	The paper does not mention any specific hardware used for computations or experiments. It is a theoretical paper.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers for its theoretical derivations or algorithms.
Experiment Setup	No	The paper is theoretical and focuses on algorithm design and proofs of regret bounds. It does not describe an empirical experimental setup with hyperparameters or system-level training settings.