Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning
Authors: Andrea Zanette, Martin J Wainwright, Emma Brunskill
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor s policies; this is a more general setting than the low-rank MDP model. Despite the added generality, the procedure is computationally tractable as it involves the solution of a sequence of second-order programs. We prove an upper bound on the suboptimality gap of the policy returned by the procedure that depends on the data coverage of any arbitrary, possibly data dependent comparator policy. The achievable guarantee is complemented with a minimax lower bound that is matching up to logarithmic factors. In this paper, we have developed and analyzed an actor-critic method procedure, designed for finding near-optimal policies in the offline setting. |
| Researcher Affiliation | Academia | Andrea Zanette University of California, Berkeley zanette@berkeley.edu Martin J. Wainwright University of California, Berkeley wainwrig@berkeley.edu Emma Brunskill Stanford University ebrun@stanford.edu |
| Pseudocode | Yes | Algorithm 1 ACTOR (MIRROR DESCENT) ... Algorithm 2 CRITIC (PLSPE) |
| Open Source Code | No | Future updates of this work will be available at https://arxiv.org/abs/2108.08812 |
| Open Datasets | No | The paper is theoretical and models data generation (Assumption 1), but it does not specify or use any publicly available datasets for empirical evaluation. |
| Dataset Splits | No | The paper is theoretical and does not include empirical experiments, therefore no training/validation/test dataset splits are provided. |
| Hardware Specification | No | The paper is theoretical and does not describe any experimental setup, thus no hardware specifications are provided. |
| Software Dependencies | No | The paper is theoretical and does not specify any software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe any empirical experimental setup, including specific hyperparameters or training configurations. |