reproducibilityindex.ai

Actor-critic is implicitly biased towards high entropy optimal policies

Authors: Yuzheng Hu, Ziwei Ji, Matus Telgarsky

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This work shows that a simple linear actor-critic (cf. Algorithm 1) in a linear MDP (cf. Assumption 1.3) with a finite but non-tabular state space (cf. Assumption 1.1) finds an ϵ-optimal policy in poly(1/ϵ) samples, without any explicit exploration or projections in the algorithm and without any uniform mixing assumptions on the policy space (cf. Theorem 1.4). The algorithm and analysis avoid both via an implicit bias towards high entropy policies: the actor-critic policy path never leaves a Kullback-Leibler (KL) divergence ball of the maximum entropy optimal policy, and this firstly ensures implicit exploration, and secondly ensures fast mixing. In more detail: 1. Actor analysis via mirror descent. ... 2. Critic analysis via projection-free sampling tools within KL balls.
Researcher Affiliation	Academia	Yuzheng Hu, Ziwei Ji, Matus Telgarsky University of Illinois, Urbana-Champaign <{yh46,ziweiji2,mjt}@illinois.edu>
Pseudocode	Yes	Algorithm 1 Single-trajectory linear actor-critic.
Open Source Code	No	The paper does not include a statement about releasing code for the methodology or a link to a code repository.
Open Datasets	No	The paper is theoretical and does not use or refer to specific public datasets with access information. The phrase 'trained on a single trajectory' refers to the conceptual operation of the algorithm, not empirical training on a dataset.
Dataset Splits	No	The paper is theoretical and does not involve empirical validation or dataset splits. Therefore, it does not provide training/test/validation dataset splits.
Hardware Specification	No	The paper is theoretical and focuses on mathematical analysis of an algorithm. It does not mention any specific hardware used for running experiments.
Software Dependencies	No	The paper is theoretical and does not describe an implementation with specific software dependencies or version numbers.
Experiment Setup	No	The paper is theoretical and does not describe an experimental setup for empirical testing. While Algorithm 1 lists parameters like 'actor iterations t and step size θ; critic iterations N and step size η', these are parameters for the theoretical analysis of the algorithm's convergence properties, not concrete experimental settings for a physical setup.