Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning

Authors: Andrea Zanette, Martin J Wainwright, Emma Brunskill

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor s policies; this is a more general setting than the low-rank MDP model. Despite the added generality, the procedure is computationally tractable as it involves the solution of a sequence of second-order programs. We prove an upper bound on the suboptimality gap of the policy returned by the procedure that depends on the data coverage of any arbitrary, possibly data dependent comparator policy. The achievable guarantee is complemented with a minimax lower bound that is matching up to logarithmic factors. In this paper, we have developed and analyzed an actor-critic method procedure, designed for finding near-optimal policies in the offline setting.
Researcher Affiliation Academia Andrea Zanette University of California, Berkeley zanette@berkeley.edu Martin J. Wainwright University of California, Berkeley wainwrig@berkeley.edu Emma Brunskill Stanford University ebrun@stanford.edu
Pseudocode Yes Algorithm 1 ACTOR (MIRROR DESCENT) ... Algorithm 2 CRITIC (PLSPE)
Open Source Code No Future updates of this work will be available at https://arxiv.org/abs/2108.08812
Open Datasets No The paper is theoretical and models data generation (Assumption 1), but it does not specify or use any publicly available datasets for empirical evaluation.
Dataset Splits No The paper is theoretical and does not include empirical experiments, therefore no training/validation/test dataset splits are provided.
Hardware Specification No The paper is theoretical and does not describe any experimental setup, thus no hardware specifications are provided.
Software Dependencies No The paper is theoretical and does not specify any software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe any empirical experimental setup, including specific hyperparameters or training configurations.