Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning

Authors: Andrea Zanette, Martin J Wainwright, Emma Brunskill

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor s policies; this is a more general setting than the low-rank MDP model. Despite the added generality, the procedure is computationally tractable as it involves the solution of a sequence of second-order programs. We prove an upper bound on the suboptimality gap of the policy returned by the procedure that depends on the data coverage of any arbitrary, possibly data dependent comparator policy. The achievable guarantee is complemented with a minimax lower bound that is matching up to logarithmic factors. In this paper, we have developed and analyzed an actor-critic method procedure, designed for finding near-optimal policies in the offline setting.
Researcher Affiliation Academia Andrea Zanette University of California, Berkeley EMAIL Martin J. Wainwright University of California, Berkeley EMAIL Emma Brunskill Stanford University EMAIL
Pseudocode Yes Algorithm 1 ACTOR (MIRROR DESCENT) ... Algorithm 2 CRITIC (PLSPE)
Open Source Code No Future updates of this work will be available at https://arxiv.org/abs/2108.08812
Open Datasets No The paper is theoretical and models data generation (Assumption 1), but it does not specify or use any publicly available datasets for empirical evaluation.
Dataset Splits No The paper is theoretical and does not include empirical experiments, therefore no training/validation/test dataset splits are provided.
Hardware Specification No The paper is theoretical and does not describe any experimental setup, thus no hardware specifications are provided.
Software Dependencies No The paper is theoretical and does not specify any software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe any empirical experimental setup, including specific hyperparameters or training configurations.