Actor-critic is implicitly biased towards high entropy optimal policies
Authors: Yuzheng Hu, Ziwei Ji, Matus Telgarsky
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This work shows that a simple linear actor-critic (cf. Algorithm 1) in a linear MDP (cf. Assumption 1.3) with a finite but non-tabular state space (cf. Assumption 1.1) finds an ϵ-optimal policy in poly(1/ϵ) samples, without any explicit exploration or projections in the algorithm and without any uniform mixing assumptions on the policy space (cf. Theorem 1.4). The algorithm and analysis avoid both via an implicit bias towards high entropy policies: the actor-critic policy path never leaves a Kullback-Leibler (KL) divergence ball of the maximum entropy optimal policy, and this firstly ensures implicit exploration, and secondly ensures fast mixing. In more detail: 1. Actor analysis via mirror descent. ... 2. Critic analysis via projection-free sampling tools within KL balls. |
| Researcher Affiliation | Academia | Yuzheng Hu, Ziwei Ji, Matus Telgarsky University of Illinois, Urbana-Champaign <{yh46,ziweiji2,mjt}@illinois.edu> |
| Pseudocode | Yes | Algorithm 1 Single-trajectory linear actor-critic. |
| Open Source Code | No | The paper does not include a statement about releasing code for the methodology or a link to a code repository. |
| Open Datasets | No | The paper is theoretical and does not use or refer to specific public datasets with access information. The phrase 'trained on a single trajectory' refers to the conceptual operation of the algorithm, not empirical training on a dataset. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical validation or dataset splits. Therefore, it does not provide training/test/validation dataset splits. |
| Hardware Specification | No | The paper is theoretical and focuses on mathematical analysis of an algorithm. It does not mention any specific hardware used for running experiments. |
| Software Dependencies | No | The paper is theoretical and does not describe an implementation with specific software dependencies or version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup for empirical testing. While Algorithm 1 lists parameters like 'actor iterations t and step size θ; critic iterations N and step size η', these are parameters for the theoretical analysis of the algorithm's convergence properties, not concrete experimental settings for a physical setup. |