Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning
Authors: Andrea Zanette, Martin J Wainwright, Emma Brunskill
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle, leading to several key advantages compared to the state of the art. The algorithm can operate when the Bellman evaluation operator is closed with respect to the action value function of the actor s policies; this is a more general setting than the low-rank MDP model. Despite the added generality, the procedure is computationally tractable as it involves the solution of a sequence of second-order programs. We prove an upper bound on the suboptimality gap of the policy returned by the procedure that depends on the data coverage of any arbitrary, possibly data dependent comparator policy. The achievable guarantee is complemented with a minimax lower bound that is matching up to logarithmic factors. In this paper, we have developed and analyzed an actor-critic method procedure, designed for finding near-optimal policies in the offline setting. |
| Researcher Affiliation | Academia | Andrea Zanette University of California, Berkeley EMAIL Martin J. Wainwright University of California, Berkeley EMAIL Emma Brunskill Stanford University EMAIL |
| Pseudocode | Yes | Algorithm 1 ACTOR (MIRROR DESCENT) ... Algorithm 2 CRITIC (PLSPE) |
| Open Source Code | No | Future updates of this work will be available at https://arxiv.org/abs/2108.08812 |
| Open Datasets | No | The paper is theoretical and models data generation (Assumption 1), but it does not specify or use any publicly available datasets for empirical evaluation. |
| Dataset Splits | No | The paper is theoretical and does not include empirical experiments, therefore no training/validation/test dataset splits are provided. |
| Hardware Specification | No | The paper is theoretical and does not describe any experimental setup, thus no hardware specifications are provided. |
| Software Dependencies | No | The paper is theoretical and does not specify any software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe any empirical experimental setup, including specific hyperparameters or training configurations. |