Actor-critic is implicitly biased towards high entropy optimal policies

Authors: Yuzheng Hu, Ziwei Ji, Matus Telgarsky

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This work shows that a simple linear actor-critic (cf. Algorithm 1) in a linear MDP (cf. Assumption 1.3) with a finite but non-tabular state space (cf. Assumption 1.1) finds an ϵ-optimal policy in poly(1/ϵ) samples, without any explicit exploration or projections in the algorithm and without any uniform mixing assumptions on the policy space (cf. Theorem 1.4). The algorithm and analysis avoid both via an implicit bias towards high entropy policies: the actor-critic policy path never leaves a Kullback-Leibler (KL) divergence ball of the maximum entropy optimal policy, and this firstly ensures implicit exploration, and secondly ensures fast mixing. In more detail: 1. Actor analysis via mirror descent. ... 2. Critic analysis via projection-free sampling tools within KL balls.
Researcher Affiliation Academia Yuzheng Hu, Ziwei Ji, Matus Telgarsky University of Illinois, Urbana-Champaign <{yh46,ziweiji2,mjt}@illinois.edu>
Pseudocode Yes Algorithm 1 Single-trajectory linear actor-critic.
Open Source Code No The paper does not include a statement about releasing code for the methodology or a link to a code repository.
Open Datasets No The paper is theoretical and does not use or refer to specific public datasets with access information. The phrase 'trained on a single trajectory' refers to the conceptual operation of the algorithm, not empirical training on a dataset.
Dataset Splits No The paper is theoretical and does not involve empirical validation or dataset splits. Therefore, it does not provide training/test/validation dataset splits.
Hardware Specification No The paper is theoretical and focuses on mathematical analysis of an algorithm. It does not mention any specific hardware used for running experiments.
Software Dependencies No The paper is theoretical and does not describe an implementation with specific software dependencies or version numbers.
Experiment Setup No The paper is theoretical and does not describe an experimental setup for empirical testing. While Algorithm 1 lists parameters like 'actor iterations t and step size θ; critic iterations N and step size η', these are parameters for the theoretical analysis of the algorithm's convergence properties, not concrete experimental settings for a physical setup.