reproducibilityindex.ai

Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning

Authors: Andrea Zanette, Martin J Wainwright, Emma Brunskill

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We propose a new ofﬂine actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor s policies; this is a more general setting than the low-rank MDP model. Despite the added generality, the procedure is computationally tractable as it involves the solution of a sequence of second-order programs. We prove an upper bound on the suboptimality gap of the policy returned by the procedure that depends on the data coverage of any arbitrary, possibly data dependent comparator policy. The achievable guarantee is complemented with a minimax lower bound that is matching up to logarithmic factors. In this paper, we have developed and analyzed an actor-critic method procedure, designed for ﬁnding near-optimal policies in the ofﬂine setting.
Researcher Affiliation	Academia	Andrea Zanette University of California, Berkeley zanette@berkeley.edu Martin J. Wainwright University of California, Berkeley wainwrig@berkeley.edu Emma Brunskill Stanford University ebrun@stanford.edu
Pseudocode	Yes	Algorithm 1 ACTOR (MIRROR DESCENT) ... Algorithm 2 CRITIC (PLSPE)
Open Source Code	No	Future updates of this work will be available at https://arxiv.org/abs/2108.08812
Open Datasets	No	The paper is theoretical and models data generation (Assumption 1), but it does not specify or use any publicly available datasets for empirical evaluation.
Dataset Splits	No	The paper is theoretical and does not include empirical experiments, therefore no training/validation/test dataset splits are provided.
Hardware Specification	No	The paper is theoretical and does not describe any experimental setup, thus no hardware specifications are provided.
Software Dependencies	No	The paper is theoretical and does not specify any software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and does not describe any empirical experimental setup, including specific hyperparameters or training configurations.